This is the fourth in a series of “How do I do X in Greenplum database” and today’s topic is How do I tune Greenplum?
I’ve been a DBA and Developer using Oracle, SQL Server, Sybase, PostgreSQL, and Greenplum. By far, “tuning” Greenplum is much easier than the other databases and the main reason why is the architecture of Greenplum versus an OLTP database. With those other databases, I always found myself looking at the top 20 worst SQL queries and then doing three things. First, tell the users that they write “bad” SQL. Second, I would create more and more indexes to handle the variety of queries. Lastly, I would redesign my dimensional model to handle how users really wanted to look at the data.
Greenplum is data model agnostic. You can use your OLTP data model directly in the database or a 3NF model or a star schema or even a very flat single table. Some databases force you into a particular data model which isn’t agile at all. It even forces waterfall development because it takes a lot of time to design those star schemas!
So now, my steps to tune Greenplum!
Step 1: Distribution
Greenplum is a shared nothing database where no two nodes share the same data. Data is spread across multiple servers based on a distribution key defined on each table. A good key is typically a unique identifier in a table and this can be a single or multiple columns. If you pick a good key, each segment will have roughly the same number of rows and at Greenplum we call this the “skew”.
Here is how to check the distribution or skew of a table:
SELECT gp_segment_id, count(*) FROM schema.table GROUP BY gp_segment_id;
For most tables, this is all of the tuning that is needed. It is unusual that all of the tables in a database are very large. Usually there are a few large tables along with a large number of medium and small tables.
Step 2: Partitioning
For your larger tables, you will want to partition the tables to eliminate Disk IO. Greenplum supports list, range, and mutli-level partitioning.
CREATE TABLE sales (trans_id int, date date, amount decimal(9,2), region text) DISTRIBUTED BY (trans_id) PARTITION BY RANGE (date) SUBPARTITION BY LIST (region) SUBPARTITION TEMPLATE ( SUBPARTITION usa VALUES ('usa'), SUBPARTITION asia VALUES ('asia'), SUBPARTITION europe VALUES ('europe'), DEFAULT SUBPARTITION other_regions) (START (date '2008-01-01') INCLUSIVE END (date '2009-01-01') EXCLUSIVE EVERY (INTERVAL '1 month'), DEFAULT PARTITION outlying_dates );
The Greenplum optimizer will eliminate partitions based on the SQL executed. Again, partitioning is for your larger tables and won’t help very much for smaller tables.
Step 3: Compression and Column Orientation
Greenplum supports both Row and Column oriented tables (and partitions). For your larger tables and especially the ones with lots of columns, using Column orientation is a good idea. It is also a good idea to compress this. It is basically another strategy to reduce Disk IO.
Here is an example:
CREATE TABLE my_customer WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz) AS SELECT customer_id, fname, lname, address1, address2, city, state, zip, create_date, bill_amount FROM customer WHERE type_cd = 'xyz' DISTRIBUTED BY (customer_id);
I only do this for my large tables. It isn’t worth it to do this unless the tables are pretty large.
Step 4: Indexes
Last but not least are indexes. I rarely use an index except for enforcing a Primary Key. Greenplum supports B-Tree, Bitmap, and GiST indexes.
– B-Tree indexes can be unique or not.
– Bitmap indexes are good useful when there are 100 to 100,000 distinct values.
– GiST indexes are used to support Postgis.
Again, I rarely use indexes but there are cases when it is helpful. It is the last in the line of things to do to “tune” your database.
Tuning in Greenplum doesn’t mean looking at lots of queries and instead it means checking table distribution. Then for your larger tables (10% or less of your database probably) Partition, Compress, and make Column oriented. Then for maybe 1 or 2 tables, use an Index.
Gone are the days of looking at the “bad” queries. No more creating indexes for everything under the sun. No more waterfall data modeling projects either! Just load the data and follow my four tuning steps.