HAWQ with Parquet Files

Parquet is a format for column oriented data to be stored in HDFS. It is part of the Apache distribution and is also available in Pivotal HD and HAWQ. HAWQ can store and read data in the Parquet format and it is also available with the open source components of Pivotal HD such as Pig and MapReduce.

Here is a quick example showing how this work.

First, create a table with some data. You can either use the CTAS method or the more traditional CREATE TABLE and then INSERT. Either method works and it is up to your preference on which you do. This example generates only 100 records of some fake customer data.


WITH (appendonly=true, orientation=parquet)
SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip
FROM (SELECT generate_series(1, 100) AS i) AS sub


  id integer,
  fname text,
  lname text,
  address text,
  city text,
  state text,
  zip text

INSERT INTO customer
SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip
FROM (SELECT generate_series(1, 100) AS i) AS sub;

Now you have data in the Parquet format in HAWQ. Pretty easy, huh?

Next, I’ll use a nifty tool that queries the HAWQ catalog which tells me where the Parquet files are.

gpextract -o customer.yaml -W customer -dgpadmin

And here is the customer.yaml file it created.

DBVersion: PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ build
  8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr
  23 2014 16:12:32
DFS_URL: hdfs://phd1.pivotalguru.com:8020
Encoding: UTF8
FileFormat: Parquet
  Checksum: false
  CompressionLevel: 0
  CompressionType: null
  EnableDictionary: false
  - path: /hawq_data/gpseg0/16385/16554/16622.0
    size: 4493
  - path: /hawq_data/gpseg1/16385/16554/16622.0
    size: 4499
  PageSize: 1048576
  RowGroupSize: 8388608
TableName: public.customer
Version: 1.0.0

Notice the path to the files which are in Hadoop and are in the Parquet format.

Now you can use a tool like Pig to look at the data.

grunt> A = load '/hawq_data/gpseg{0,1}/16385/16554/16622' USING parquet.pig.ParquetLoader();
grunt> describe A;                                                                          
A: {id: int,fname: bytearray,lname: bytearray,address: bytearray,city: bytearray,state: bytearray,zip: bytearray}
grunt> B = foreach A generate id, fname, lname, address, city, state, zip;
grunt> dump B;
(2,jon2,roberts2,2 main street,new york,ny,00002)
(4,jon4,roberts4,4 main street,new york,ny,00004)
(6,jon6,roberts6,6 main street,new york,ny,00006)
(8,jon8,roberts8,8 main street,new york,ny,00008)

Parquet is easy to use in HAWQ and doesn’t lock you into a Pivotal HD and HAWQ only solution. It is easy to use the other tools like Pig or MapReduce to read the Parquet files in your Hadoop cluster. No vendor lock-in.

2 thoughts on “HAWQ with Parquet Files

    1. Jon Post author

      It depends.

      HAWQ, like Greenplum database, implements columnar storage with a file per column. This is great in reducing I/O because it reads smaller files. However, Hadoop doesn’t do well with lots and lots of files so you may actually hurt performance in HAWQ with a column oriented table.

      For example, let’s say you have a partitioned and column oriented table in HAWQ with 37 partitions, 100 columns, and 96 segments. With this scenario, you have roughly 96 * 100 * 37 files for this one table. With Parquet, you have 96 * 37 files which is more manageable.

      So you may have a Parquet table perform better than Column oriented and then a different table may perform better with Column orientation over Parquet.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.