When should I use Greenplum Database versus HAWQ?

I get this question a lot so I thought I would post the answer. Greenplum is a robust MPP database that works very well for Data Marts and Enterprise Data Warehouses that tackles historical Business Intelligence reporting as well as predictive analytical use cases. HAWQ provides the most robust SQL interface for Hadoop and can tackle data exploration and transformation in HDFS.

HAWQ
HAWQ is “HAdoop With Query” and is basically a port of Greenplum to store data natively in HDFS. This empowers everyone using Hadoop to use a friendlier and more robust programming interface to Hadoop. Tools for Hadoop have evolved over time because of the need to get more consumers using Hadoop in a more productive manner.

The HAWQ architecture is very similar right now with each segment having its own files but instead of storing these files local to each segment (shared nothing), the data is stored in HDFS. Over time, I believe this architecture will change to enable greater elasticity in the number of segments processing data but right now, Greenplum and HAWQ share this design.

Hadoop Architecture
There are lots of material online about how Hadoop works but I want to point out one of the principals of Hadoop and how it relates to having a relational database store its files in it. Simple Coherency Model is a “write-once-read-many access model for files” that HDFS uses to simplify the data model in Hadoop. Once you “put” a file into HDFS, you can’t edit it to change the file. You can remove the file but once you write it, that is it.

Database Features
There are many “small” features in Greenplum Database that either aren’t supported yet in HAWQ or will not because HDFS won’t allow it. These features need to be taken into consideration when choosing which solution to deploy on. For example, here is a short list of features not found in HAWQ but are in Greenplum database: LOCK TABLE, CREATE INDEX, CREATE TYPE, CREATE RULE, CREATE AGGREGATE, and CREATE LANGUAGE.

Two more commonly used features of Greenplum that are not in HAWQ are DELETE and UPDATE commands. When a database executes a DELETE or UPDATE, it is editing the file. It basically needs to mark the old row as removed which requires changing the file. This isn’t allowed in HDFS so consequently, UPDATE and DELETE commands are not allowed in HAWQ.

HAWQ Features
Now there are some features of HAWQ that don’t exist in Greenplum database that are also compelling. PXF is the Pivotal Extensible Framework that provides enhanced capabilities when working with data in HDFS from HAWQ. Simply put, you can create an External Table to HDFS that reads Text, Hive, HBase, and soon Parquet. These are optimized for these formats and can leverage partitioning and predicate push-down. Another feature of this is the ability to gather statistics on these external tables to HDFS. So if you a large amount invested in Hadoop already, HAWQ can leverage that investment and extend it with more capabilities.

How does it all fit together?
A lot of companies are embracing Hadoop as it is an inexpensive, commodity hardware based, scalable solution that can handle terabytes to petabytes worth of data. HAWQ makes Hadoop better by providing a robust SQL interface built on over 10 years development of the Greenplum database.

Companies are building their “data lakes” and using Pig, Hive, HBase, and now SQL to explore and transform their data. This is where HAWQ excels.

Once data is transformed, companies are loading the relational data for consumption into Data Marts or Enterprise Data Warehouses. This is where Greenplum database comes in by leveraging a robust MPP database to provide historical Business Intelligence capabilities and forward looking predictive analytics all in-database with SQL.

Architecture

4 thoughts on “When should I use Greenplum Database versus HAWQ?

  1. Frank Lee

    Thanks for the detailed explanation.

    Would you be able to confirm that except for the external table of HAWQ, the underlying data of the standard HAWQ table are stored in a pre-processed, optimized and native format just like any typical traditional relational database, in other words schema-on-write? And there is no way for other hadoop component like hive, spark to read and consume the HAWQ table data / metadata?

    I have no problem with that as I can imagine that schema-on-write approach can boost DB query performance. I ask this question because I have read some other articles which claim that HAWQ is schema-on-read and “MapReduce jobs can also access HAWQ data”. I would like to know whether these statements only apply to PXF which you mentioned.

    Reply
    1. Jon Post author

      HAWQ can store data in a native format in HDFS or in parquet format but that doesn’t mean the data isn’t available in MapReduce. Other components can read the native or parquet formats. Here is the link to the documentation that explains how to read the data from other tools: http://hawq.docs.pivotal.io/docs-hawq/topics/HAWQInputFormatforMapReduce.html

      I also have a blog post on reading a HAWQ table stored in the parquet format from pig. http://www.pivotalguru.com/?p=727

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *