HAWQ versus Greenplum Database

HAWQ is a port of Greenplum database to work with the HDFS file system from Hadoop. It is only available with the Pivotal HD distribution of Hadoop even though Pivotal HD is 100% Apache Hadoop compliant.

This post will focus on the major differences in the filesystem, External Tables, DDL commands, and DML commands. I’m comparing Greenplum database version 4.3 with HAWQ version 1.2.0.1 which is shipped with Pivotal HD version 2.0.1.

HDFS versus Posix Filesystem
With Greenplum database, each segment’s files are always local to the host. Even in a failover condition when the mirror is acting as the primary, the data is local to the processing.

With HAWQ, HDFS handles high availability by having three copies of the data across multiple nodes. Because of this, the mirroring that is built into the Greenplum database is removed from HAWQ. If a node were to fail, then Hadoop automatically creates a third copy of the data. So a segment running on a Hadoop data node may not have the data it needs local to it and will need to get data from other physical nodes.

External Tables
Both HAWQ and Greenplum database have External Tables but differ when accessing external data in Hadoop.

HAWQ has PXF which is the Pivotal Extension Framework. It has the ability to access files in HDFS stored as plain text but also in Hive, Hbase, Avro, and Gemfire XD. You can write your own custom profiles to get data from HDFS. PXF can also get some statistics about these files so the optimizer is smarter when accessing these External Tables.

Greenplum database doesn’t have PXF but does have GPHDFS. GPHDFS enables Greenplum database to read and write data to HDFS. It doesn’t have built-in capabilities to Avro, Hive, HBase, and Gemfire XD. It also doesn’t have statistics for these External Tables.

HAWQ is great at exploring and transforming data in Hadoop while Greenplum database is great at bulk loading data from Hadoop into the database as well as bulk writing data from Greenplum database into Hadoop. So land all of your data in Hadoop, transform it with SQL and then create data marts in Greenplum database.

Functions
Both offer functions but HAWQ doesn’t have SECURITY DEFINER functions yet.

DDL Commands
Here is a list of commands that are in Greenplum database but not in HAWQ:

  • CREATE AGGREGATE: user defined aggregate like SUM and COUNT.
  • CREATE CAST: user defined conversion of two datatypes.
  • CREATE CONVERSION: user defined conversion of character set encodings.
  • CREATE DOMAIN: user defined datatype with optional constraints.
  • CREATE INDEX: indexes aren’t supported in HAWQ.
  • CREATE OPERATOR: user defined operator like != is the same as <>.
  • CREATE OPERATOR CLASS: user defined class of how a data type is used within an Index.
  • CREATE RULE: user defined filter placed on a table or view like “gender_code in (‘M’, ‘F’)”
  • CREATE TABLESPACE: user defined directory to be using in Posix filesystem to store database objects.
  • CREATE TRIGGER: user defined trigger for a table. Note that this is very limited in Greenplum database.

DML Commands
HDFS is designed for “write once, read many” and can not handle file pruning which is required for DELETE and UPDATE commands. Because of this HAWQ doesn’t support UPDATE and DELETE commands while Greenplum database does.

Summary
There are other small differences between the two products but these are the major ones.

2 thoughts on “HAWQ versus Greenplum Database

  1. Nguyen Anh Nguyen

    “…DML Commands
    HDFS is designed for “write once, read many” and can not handle file pruning which is required for DELETE and UPDATE commands. Because of this HAWQ doesn’t support UPDATE and DELETE commands while Greenplum database does….”

    While: https://hdb.docs.pivotal.io/211/hawq/overview/HAWQOverview.html saying PivotalHDB/HAWQ is ACID-compliant ? This means that HAWQ is not ACID-compliant even with row-orientation on HDFS or not ?

    Confusing !

    Reply
    1. Jon Post author

      Pivotal HDB / Apache HAWQ is ACID compliant.
      Atomicity is all or nothing. HAWQ won’t insert some of your data and leave out others. It is all or nothing.
      Consistency means that transactions will bring the database from one state to another. HAWQ does this.
      Isolation means that one rolled back transaction won’t impact another transaction. HAWQ does this. You can be inserting data to a table from two different sessions. One can then rollback the transaction while the other commits without an issue.
      Durability means that once the data is committed, it will stay.

      Nothing in ACID requires DELETE or UPDATE.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *