Note: This post is for Greenplum only and I will make a subsequent post for HAWQ.
What Is Analyze?
Analyze is a pretty simple command that gets statistics on tables so the cost based optimizer can make the best plan possible. Greenplum has a mechanism to do this automatically for you but in some circumstances it won’t so you will have to run it manually.
This is the default which simply means the database will automatically gather stats when data is inserted into the table for the first time. When you do a Create Table As Select (CTAS), it will gather the statics for you. If you create a table and then INSERT data, it will gather stats for you. But, if you INSERT data to a table that already has stats, Greenplum will not gather stats for you again.
CREATE TABLE foo (id int, fname text) DISTRIBUTED BY (id); INSERT INTO foo VALUES (1, 'jon'); INSERT INTO foo select i, 'jon_' || i from generate_series(1,1000000) as i;
In the above example, the first INSERT statement will cause the database to gather the stats and it will record that foo only has 1 row in it. The next INSERT of 1 million records won’t trigger another ANALYZE. Do you see where you could get into trouble? You need to run ANALYZE on foo.
This sounds like a good idea. If you insert more data into a table, it will automatically gather stats for you. But, there is another GUC that is important here. gp_autostats_on_change_threshold determines how many rows need to be added to a table before an ANALYZE is run automatically again. The default value is 2,147,483,647!
So in our first example of adding 1 million rows to an existing table, ANALYZE will not run automatically if we were to set gp_autostats_mode=none.
If you want to use on_change, I suggest reducing the threshold to a lower number.
This is the last value you can set for this GUC and it does exactly what is says. Nothing.
If you insert data into an empty partitioned table, an ANALYZE will never automatically be executed! You will have to execute ANALYZE manually.
This is really important. Inserting data into partitioned tables will not cause an ANALYZE to be executed. The stats will be incorrect and you will need to run ANALYZE on the table.
The new Query Optimizer, code named Orca, must have statistics on the “rootpartition”. This simply is a rollup of all of the statistics from each partition at the table level.
When enabling Orca, you will need to add this which is documented in this blog post.:
gpconfig -c optimizer_analyze_root_partition -v on --masteronly
This will tell Greenplum to gather the table level statistics on a table when you ANALYZE it. If you don’t do this, the table will appear to be empty to Orca and it will most certainly have a poor query plan.
You can also run ANALYZE on just the rootpartition.
ANALYZE ROOTPARTITION foo;
Please Make this Easy! analyzedb
We did! There is a new utility called “analyzedb” which tracks what tables and partitions need to be analyzed and does it for you automatically. It also does the work in parallel so it completes the task much, much faster.
You can specify an entire database:
analyzedb -d gpdb
analyzedb -d gpdb -s myschema
or a single table.
analyzedb -d gpdb -t public.foo
analyzedb is the the program to add to your maintenance scripts to ensure you have the correct statistics in the database. It is another example of the continual development and enhancements of Greenplum database.