Monthly Archives: September 2014

When to ANALYZE in Greenplum and HAWQ?

Table statistics gives the cost based optimizer information needed to build the best query plan possible and this information is gathered with the ANALYZE command. But when should you execute an ANALYZE on a table? Doesn’t Grenplum and HAWQ do this automatically?

Greenplum and HAWQ will perform an ANALYZE automatically for you so the query optimizer will have good statistics and build a good plan. Greenplum and HAWQ also allow you to configure this if needed with gp_autostats_mode and gp_autostats_on_change_threshold.

gp_autostats_mode
This specifies when an automatic ANALYZE should take place. The values for this configuration parameter can be:
none
on_change
on_no_stats

on_no_stats
The default is on_no_stats and the database will automatically perform an ANALYZE for you. If you INSERT data into an empty table (new table or a table you just did a TRUNCATE on), the database will automatically gather statistics with an ANALYZE.

Typically in an Analytics Data Warehouse, you will insert data once and then read it many times. So the default of on_no_stats will automatically give you statistics as needed with no additional work from you.

none
Self explanatory. You have to execute ANALYZE if you want statistics.

on_change
With this setting, if perform an INSERT, UPDATE, or DELETE that exceeds the gp_autostats_on_change_threshold value, then an automatic ANALYZE will happen. Note for HAWQ, you can only INSERT.

In Greenplum and in the rare case where you are doing a DELETE or UPDATE to a table, you will need to execute an ANALYZE statement. Or, you can set the database to use on_change and the ANALYZE will happen automatically.

gp_autostats_on_change_threshold
This is only relevant for on_change and it is the number of rows that need to change before an automatic ANALYZE will happen. The default is over 2 billion rows so if you really want to use on_change, then you will also need to reduce this configuration parameter to something more realistic.

Examples

Example 1 – The default
gp_autostats_mode = on_no_stats
gp_autostats_on_change_threshold = 2147483647

CREATE TABLE my_table AS 
SELECT state, count(*) AS counter
FROM customer
GROUP BY state
DISTRIBUTED BY (state);
--Statistics will be gathered automatically.
CREATE TABLE my_table 
(state varchar(100),
 counter int)
DISTRIBUTED BY (state);

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.
TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.
TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics won't be updated.  The planner will think the table has only 1 row.

ANALYZE my_table;
--Statistics manually gathered and correctly shows the correct number of rows in the table.

Example 2 – Using on_change
gp_autostats_mode = on_change
gp_autostats_on_change_threshold = 1000000

CREATE TABLE my_table AS 
SELECT state, count(*) AS counter
FROM customer
GROUP BY state
DISTRIBUTED BY (state);
--Statistics will be gathered automatically only if the number of rows is 1M or more.
CREATE TABLE my_table 
(state varchar(100),
 counter int)
DISTRIBUTED BY (state);

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.
TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.
TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will not be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.

Checking the Statistics
And here are a couple of queries you can use to see the statics information.

SELECT c.oid, c.relpages, c.reltuples 
FROM pg_class c 
JOIN pg_namespace n ON c.relnamespace = n.oid 
WHERE n.nspname = 'public' 
AND c.relname = 'my_table';
--using the oid from the previous query
SELECT * FROM pg_statistic WHERE starelid = 298610;

Summary
Greenplum and HAWQ automatically gather statistics for you in most cases. If you are doing lots of DML activity, you can change the configuration to still automatically gather statistics for you too. These automatic settings make life easier for DBAs, Developers, and Analysts but still give you the flexibility to configure it in the best way for your environment.

Greenplum Database Incremental Backup

Greenplum Database has a feature called “incremental” backup but really it is a incremental differential backup. In this post, I will explain how it works and give you a working example.

AO Tables
In version 4.2 and earlier, AO tables are Append Only but starting in 4.3, the tables are Append Optimized. Everyone using 4.3 should be using AO tables as it uses less space, supports this incremental feature, and executes faster. Incremental backup requires AO tables.

Example:

CREATE TABLE foo (id integer, bar text) WITH (appendonly=true) DISTRIBUTED BY (id);

Using the incremental option on a regular heap table will not make any difference. It will backup all heap tables with every backup.

Demo

First create a database.

gpdb=# drop database poc;
DROP DATABASE
gpdb=# CREATE DATABASE poc;
CREATE DATABASE
gpdb=# \c poc
You are now connected to database "poc" as user "gpadmin".

Now create an AO table and insert some data.

poc=# CREATE TABLE foo (id integer, bar text) WITH (appendonly=true) DISTRIBUTED BY (id);
CREATE TABLE
poc=# INSERT INTO foo VALUES (1, 'bar_1');
INSERT 0 1
poc=# INSERT INTO foo VALUES (2, 'bar_2');
INSERT 0 1
poc=# 

I’m next going to execute a full backup. For the demo, I’m going to turn off compression so it is easier to view the backup files.

I’m using gpcrondump with these switches:
-x specifies the database to backup
-a means automatic and don’t prompt
-z turns off compression

bigmac:~ gpadmin$ gpcrondump -x poc -a -z
20140909:13:39:19:006156 gpcrondump:bigmac:gpadmin-[INFO]:-Starting gpcrondump with args: -x poc -a -z
...

Just to see what is in the backup file, I’m going to cat one of the segment’s backup files. It shows one of the two records in this tiny table.

bigmac:~ gpadmin$ cat /data/primary/gpseg0/db_dumps/20140909/gp_dump_0_2_20140909133919 
--
-- Greenplum Database database dump
--

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

SET search_path = public, pg_catalog;

SET default_with_oids = false;

--
-- Data for Name: foo; Type: TABLE DATA; Schema: public; Owner: gpadmin
--

COPY foo (id, bar) FROM stdin;
1	bar_1
\.


--
-- Greenplum Database database dump complete
--

Now I immediately run an incremental backup. Nothing changed in my database so it should be an empty backup set.

bigmac:~ gpadmin$ gpcrondump -x poc -a -z --incremental
20140909:13:41:19:006395 gpcrondump:bigmac:gpadmin-[INFO]:-Starting gpcrondump with args: -x poc -a -z --incremental
...

To verify:

bigmac:~ gpadmin$ cat /data/primary/gpseg0/db_dumps/20140909/gp_dump_0_2_20140909134119 
--
-- Greenplum Database database dump
--

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

--
-- Greenplum Database database dump complete
--

So my incremental backup doesn’t have any data in it. If I had deleted or inserted any data to my one table, the entire table would be part of the backup set. That is why it is an incremental differential backup. If there is a difference in the AO table (or partition), the entire table/partition will be included in the incremental backup.

Restore
So now that I have one full backup and one incremental, let’s drop the table and then restore it.

bigmac:~ gpadmin$ psql poc
psql (8.2.15)
Type "help" for help.

poc=# DROP TABLE foo;
DROP TABLE
poc=# \q

And now the restore using gpdbrestore. I’m using the following switches:
-a means automatic and don’t prompt
-e will drop the target database and then recreate it
-s uses the latest backup set

bigmac:~ gpadmin$ gpdbrestore -a -e -s poc
20140909:13:42:58:006720 gpdbrestore:bigmac:gpadmin-[INFO]:-Starting gpdbrestore with args: -a -e -s poc
...

Verify the restore worked:

bigmac:~ gpadmin$ psql poc
psql (8.2.15)
Type "help" for help.

poc=# SELECT * FROM foo;
 id |  bar  
----+-------
  1 | bar_1
  2 | bar_2
(2 rows)

poc=# 

Conclusion
Both commands used here, gpcrondump and gpdbrestore, have lots of options to backup and restore respectively. Using the incremental switch on gpcrondump will save you disk space and execution time while still giving you the ability to do a full system restore or even a selective table restore.

Hadoop Distributions

Today’s landscape of Hadoop vendors is mostly comprised of privately held companies that have big investments and partnerships with well established companies. This is very similar to the MPP marketplace in 2010 when these companies were purchased by much larger companies.

Here are a few of the MPP database vendors with their buyers:
Greenplum => EMC
Netezza => IBM
DATAllegro => Microsoft
Aster Data => Teradata
Vertica => HP

Similarly, I think the landscape of Hadoop vendors will change in the near future. Here are the major vendors in this Hadoop space as of September 2014:

Cloudera

  • Private
  • Investments: 2011 – $40M; 2014 – $900M
  • Around 600 employees
  • Founded in 2009
  • Partners with Oracle, Intel (funding), and Amazon (but also competes with Amazon)

Hortonworks

  • Private
  • Investements: 2011 – $23M + $25M
  • 201-500 employees
  • Founded in 2011
  • Partners with Yahoo, Teradata, and SAP

IBM

  • Public
  • $100B Revenue / year
  • 400K employees
  • Founded in 1911

MapR

  • Private
  • Investments: 2009 – $9M; 2014 – $110M
  • 201-500 employees
  • Founded in 2009
  • Partners with Google

Pivotal

  • Private
  • Investments: 2013 – $100M from GE and assets from EMC and VMWare
  • 3000+ employees
  • Founded in 2013 (Pivotal), 2003 (Greenplum), 1998 (VMWare) and 1979 (EMC)
  • Partners with EMC, VMWare, and GE

Amazon

  • Public
  • $75B Revenue / year
  • 132K employees
  • Founded in 1994

Hadoop Vendors Tomorrow
Cloudera => Oracle or Amazon
It will probably be Oracle because of the existing partnership and leadership that came from Oracle but Amazon may want it more. If Oracle doesn’t buy Cloudera, they will probably try to create their own distribution like they did with Linux.

Hortonworks => Teradata
It is only a matter of time before Teradata will have to buy Hortonworks. Microsoft might try to buy Hortonworks or just take a fork of the Windows version to rebrand. Microsoft worked with Sybase a long time ago with SQL Server and then took the code and ran rather than buying Sybase. So because of that history, I think Microsoft won’t buy and Teradata will.

Teradata bought Aster Data and Hortonworks would complete their data portfolio. Teradata for the EDW, Aster Data for Data Marts, and Hortonworks for their Data Lake.

MapR => Google
Google will snatch up MapR which will make MapR very happy.

So that leaves IBM and Amazon as the two publicly held companies left. Pivotal is privately held but by EMC, VMWare, and GE which gives all indications based on past actions by EMC that this company will go public and be big.

Post Acquisitions
So after the big shakeup, I think you’ll see these vendors remaining selling Hadoop:

  • Pivotal: 100% Apache based with the best SQL Engine
  • IBM: Big Insights
  • Teradata: Hortonworks
  • Oracle: Cloudera
  • Google: MapR
  • Amazon: Elastic MapReduce

I could be wrong but I really do think there will be a consolidation of vendors in the near future.