HAWQ is a port of Greenplum database to work with the HDFS file system from Hadoop. It is only available with the Pivotal HD distribution of Hadoop even though Pivotal HD is 100% Apache Hadoop compliant.
This post will focus on the major differences in the filesystem, External Tables, DDL commands, and DML commands. I’m comparing Greenplum database version 4.3 with HAWQ version 18.104.22.168 which is shipped with Pivotal HD version 2.0.1.
HDFS versus Posix Filesystem
With Greenplum database, each segment’s files are always local to the host. Even in a failover condition when the mirror is acting as the primary, the data is local to the processing.
With HAWQ, HDFS handles high availability by having three copies of the data across multiple nodes. Because of this, the mirroring that is built into the Greenplum database is removed from HAWQ. If a node were to fail, then Hadoop automatically creates a third copy of the data. So a segment running on a Hadoop data node may not have the data it needs local to it and will need to get data from other physical nodes.
Both HAWQ and Greenplum database have External Tables but differ when accessing external data in Hadoop.
HAWQ has PXF which is the Pivotal Extension Framework. It has the ability to access files in HDFS stored as plain text but also in Hive, Hbase, Avro, and Gemfire XD. You can write your own custom profiles to get data from HDFS. PXF can also get some statistics about these files so the optimizer is smarter when accessing these External Tables.
Greenplum database doesn’t have PXF but does have GPHDFS. GPHDFS enables Greenplum database to read and write data to HDFS. It doesn’t have built-in capabilities to Avro, Hive, HBase, and Gemfire XD. It also doesn’t have statistics for these External Tables.
HAWQ is great at exploring and transforming data in Hadoop while Greenplum database is great at bulk loading data from Hadoop into the database as well as bulk writing data from Greenplum database into Hadoop. So land all of your data in Hadoop, transform it with SQL and then create data marts in Greenplum database.
Both offer functions but HAWQ doesn’t have SECURITY DEFINER functions yet.
Here is a list of commands that are in Greenplum database but not in HAWQ:
- CREATE AGGREGATE: user defined aggregate like SUM and COUNT.
- CREATE CAST: user defined conversion of two datatypes.
- CREATE CONVERSION: user defined conversion of character set encodings.
- CREATE DOMAIN: user defined datatype with optional constraints.
- CREATE INDEX: indexes aren’t supported in HAWQ.
- CREATE OPERATOR: user defined operator like != is the same as <>.
- CREATE OPERATOR CLASS: user defined class of how a data type is used within an Index.
- CREATE RULE: user defined filter placed on a table or view like “gender_code in (‘M’, ‘F’)”
- CREATE TABLESPACE: user defined directory to be using in Posix filesystem to store database objects.
- CREATE TRIGGER: user defined trigger for a table. Note that this is very limited in Greenplum database.
HDFS is designed for “write once, read many” and can not handle file pruning which is required for DELETE and UPDATE commands. Because of this HAWQ doesn’t support UPDATE and DELETE commands while Greenplum database does.
There are other small differences between the two products but these are the major ones.