Checking for “Skew”

Shared Nothing / MPP Database Architecture
Greenplum database is a “Shared Nothing” database in which data is spread out across many segments located on multiple segment hosts. No two segments have the same data and data is distributed based on a distribution key.

For example, let’s say you have a customer table in Greenplum with the distribution key of (customer_id). Your Greenplum cluster has 16 segments so your data would actually be physically located like this:

Sement Customer ID
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16

When a query is executed like “SELECT COUNT(*) FROM customer”, there will be 16 processes created (one for each segment) where the number of rows are counted in parallel across all segments. This is the design behind a Shared Nothing or Massively Parallel Processing (MPP) database.

So what is skew? In my customer table example, I have even distribution. That is, each segment has the same number of rows. If I distributed instead by state_code and all of my customers are in Tennessee, then all of the data would be in a single segment and thus be skewed.

Having tables that are skewed will slow down processing significantly because it isn’t taking advantage of the entire cluster.

Do I need to Check For Skew?
If you set your distribution key to the logical or physical Primary Key of the table, checking for skew isn’t needed. This will give you great distribution of data.

There are times when you don’t have a PK on a table or you want to set the distribution key to something else because you commonly join two very large tables together. For example, customer and orders tables may perform better if distributed by customer_id so that the data is collocated. However, if the majority of your orders are by one customer, you will have skew in the orders table.

Checking for Skew Method 1
This is the most common and straight forward way to check for skew.

SELECT gp_segment_id, COUNT(*)
FROM customer
GROUP BY gp_segment_id

This will count the rows per segment and give you a good indication of the skew. The downside to this is it physically counts the rows for this one table at at time. This takes time especially if you have a large number of tables.

Checking for Skew Method 2
I developed this second approach which looks at the file sizes for each table for each segment. It then will output only the tables that have at least one segment with more than 20% more bytes than expected.

Execute the following script in your database.

        v_function_name text := 'fn_create_db_files';
        v_location int;
        v_sql text;
        v_db_oid text;
        v_num_segments numeric;
        v_skew_amount numeric;
        v_location := 1000;
        SELECT oid INTO v_db_oid 
        FROM pg_database 
        WHERE datname = current_database();

        v_location := 2000;
        v_sql := 'DROP VIEW IF EXISTS vw_file_skew';

        v_location := 2100;
        EXECUTE v_sql;
        v_location := 2200;
        v_sql := 'DROP EXTERNAL TABLE IF EXISTS db_files';

        v_location := 2300;
        EXECUTE v_sql;

        v_location := 3000;
        v_sql := 'CREATE EXTERNAL WEB TABLE db_files ' ||
                '(segment_id int, relfilenode text, filename text, ' ||
                'size numeric) ' ||
                'execute E''ls -l $GP_SEG_DATADIR/base/' || v_db_oid || 
                ' | ' ||
                'grep gpadmin | ' ||
                E'awk {''''print ENVIRON["GP_SEGMENT_ID"] "\\t" $9 "\\t" ' ||
                'ENVIRON["GP_SEG_DATADIR"] "/' || v_db_oid || 
                E'/" $9 "\\t" $5''''}'' on all ' || 'format ''text''';

        v_location := 3100;
        EXECUTE v_sql;

        v_location := 4000;
        SELECT count(*) INTO v_num_segments 
        FROM gp_segment_configuration 
        WHERE preferred_role = 'p' 
        AND content >= 0;

        v_location := 4100;
        v_skew_amount := 1.2*(1/v_num_segments);
        v_location := 4200;
        v_sql := 'CREATE OR REPLACE VIEW vw_file_skew AS ' ||
                 'SELECT schema_name, ' ||
                 'table_name, ' ||
                 'max(size)/sum(size) as largest_segment_percentage, ' ||
                 'sum(size) as total_size ' ||
                 'FROM	( ' ||
                 'SELECT n.nspname AS schema_name, ' ||
                 '      c.relname AS table_name, ' ||
                 '      sum(db.size) as size ' ||
                 '      FROM db_files db ' ||
                 '      JOIN pg_class c ON ' ||
                 '      split_part(db.relfilenode, ''.'', 1) = c.relfilenode ' ||
                 '      JOIN pg_namespace n ON c.relnamespace = n.oid ' ||
                 '      WHERE c.relkind = ''r'' ' ||
                 '      GROUP BY n.nspname, c.relname, db.segment_id ' ||
                 ') as sub ' ||
                 'GROUP BY schema_name, table_name ' ||
                 'HAVING sum(size) > 0 and max(size)/sum(size) > ' || 
                 v_skew_amount::text || ' ' || 
                 'ORDER BY largest_segment_percentage DESC, schema_name, ' ||

        v_location := 4300;
        EXECUTE v_sql; 

                RAISE EXCEPTION '(%:%:%)', v_function_name, v_location, sqlerrm;
language plpgsql;

Now execute the function so it will create the necessary objects in your database.

SELECT fn_create_db_files();

Now select from the view to see the tables that are skewed.

SELECT * FROM vw_file_skew ORDER BY 3 DESC;

When tuning a database, a Greenplum DBA will typically only need to worry with skew instead of tuning queries like other databases. Once the tables are created with good distribution keys, the tuning job is 90% or more done. If you think one particular table is not distributed well, use Method 1. If you need to validate an entire database quickly, use Method 2.

6 thoughts on “Checking for “Skew”

  1. Danilo Fortunato

    Thanks a lot, Jon.

    The method based on the file size of the tables is really useful if you need to check the presence of data skew on an entire database.

  2. Cecil

    Is there a way to study the data to discover a good choice for a distribution key? Or does GP just do a blind round robin to evenly distribute the data? I think not, since this would end up being similar to a random distribution. So I think it must have hash algorithm that it is using. If I had that hash algorithm I could apply it to see if it distributes evenly or not.

    1. Jon Post author

      The general rule of thumb is to use the logical primary key of the table for the distribution key. For example, with a single Integer column, the hash is very simple where it puts ID=1 in segment0, ID=2 in segment1, ID=3 in segment2, etc.

      Typically, the only time where you want to deviate from this is 1) you are joining two large tables together and you want the joins to be collocated to avoid data movement and 2) you have processing skew which happens when an uneven number of segments process the query.


Leave a Reply

Your email address will not be published. Required fields are marked *