Monthly Archives: May 2015

HAWQ versus Hive on HDP 2.2.4.2 (Part 2)

Part 1 covered the comparison of HAWQ versus Hive using Map-Reduce instead of Tez. Now in Part 2, I will execute the same scripts but with Tez enabled.

Enabling Tez is pretty easy to do with HDP 2.2.4.2. You simply change this in the General section under Hive and restart Hive:

hive.execution.engine = tez

Next, I re-ran the same loads and queries. Overall, it is roughly 25% faster with Tez enabled but it is still much slower than HAWQ.

Here is the graph of results:
HAWQ vs Hive 2

Observations
– Loading was slower with Tez enabled but this is probably because I’m testing with a VM.
– Every query was about 40% faster in Hive with Tez versus without.
– HAWQ was about 30 times faster than Hive with Tez and about 57 times faster than Hive with Map-Reduce.
– The execute time reported by Hive was incorrect for each query. For example, I used “time hive -f icd9.sql” to execute a the icd9.sql query and capture the timings. Time reported:

real    0m25.060s

but Hive reported:

Time taken: 13.17 seconds, Fetched: 19 row(s)

So another test but the same result. Hive is much, much slower than HAWQ and even with Tez enabled. If you want Enterprise level SQL for Hadoop, HAWQ is the best solution.

HAWQ versus Hive on HDP 2.2.4.2 (Part 1)

I recently built a Virtual Machine running CentOS 6.4, Hortonworks installation of Ambari 1.7, Hortonworks 2.2.4.2, and Pivotal HAWQ 1.3.0.1. If you aren’t already familiar with with the Open Data Platform, it is the shared industry effort to standardize the core components of Hadoop (HDFS, YARN, MapReduce, and Ambari) that both Hortonworks and Pivotal are a part of. This means I have HAWQ running on the Hortonworks deployment and this is a supported configuration.

After I got this built, I decided to put a set of demo scripts I use for HAWQ in the VM. It contains a simple Star Schema of Center for Medicare and Medicaid Services (CMS) data. It consists of 5 Dimensions and a single claims Fact with 1 million claims. CMS Star Schema

The demo does the following:
1. Loads the claims Fact with 1 million claims
2. Create and loads the 5 Dimensions
3. Creates and loads a single “Sandbox” table that joins all Dimensions to build a 1NF table
4. Execute 3 basic ad-hoc queries

I then converted my scripts to do the same with Hive but I ran into one problem. CMS provides one table with a range of values rather than providing each distinct value. For example:

001|139|Infectious and parasitic diseases
140|239|Neoplasms
240|279|Endocrine, nutritional and metabolic diseases, and immunity disorders
...

With HAWQ, I used generate_series(int, int) in a simple SQL query to generate every value in the range. Unfortunately, Hive doesn’t support this function so I had two options; write my own UDF or borrow the transformed data from HAWQ. I chose the latter because I could get this done faster.

And now, the results!

HAWQ Timings
Load CMS Table: 3.1 seconds
Dimensions: 1.6 seconds
Sandbox Table: 3.1 seconds
Gender Query: 0.9 second
Location Query: 0.72 second
ICD9 Query: 0.95 second
Total: 10.37 seconds

Hive Timings
Load CMS Table: 44.5 seconds
Dimensions (4 of the 5): 3 minutes 42.6 seconds
Dimension Fix: 35.7 seconds
Sandbox Table: 45.6 seconds
Gender Query: 47.5 seconds
Location Query: 48.4 seconds
ICD9 Query: 48.8 seconds
Total: 8 minutes 13.2 seconds

This graph is very telling.
HAWQ vs Hive

Hive looks to be a good entry level tool for SQL on Hadoop. The performance isn’t great but it comes with every distribution of Hadoop and is easier to use than other Hadoop tools. But HAWQ will give you significantly better performance and this can be on either the Pivotal HD or the Hortonworks distributions. Both distributions are based on the ODP which delivers the standard Hadoop core so that HAWQ will work on either distribution.

In my next post, I plan on enabling TEZ for Hive to see what the performance benefit will be. It should help but I don’t think it will come close to the performance of HAWQ. We shall see.

XML Loading Again

I covered this topic here before with a solution that handles very large XML files in Hadoop. This blog post covers parsing XML as it is loaded in Hawq or Greenplum database.

sample.xml

<?xml version="1.0"?>
<catalog>
      <large-product>
         <name>foo1</name>
         <price>110</price>
      </large-product>
      <large-product>
         <name>foo2</name>
         <price>120</price>
      </large-product>
      <large-product>
         <name>foo3</name>
         <price>130</price>
      </large-product>
      <large-product>
         <name>foo4</name>
         <price>140</price>
      </large-product>
      <large-product>
         <name>foo5</name>
         <price>150</price>
      </large-product>
      <small-product>
         <name>bar1</name>
         <price>10</price>
      </small-product>
      <small-product>
         <name>bar2</name>
         <price>20</price>
      </small-product>
      <small-product>
         <name>bar3</name>
         <price>30</price>
      </small-product>      
      <small-product>
         <name>bar4</name>
         <price>40</price>
      </small-product>
      <small-product>
         <name>bar5</name>
         <price>50</price>
      </small-product>
</catalog>
</pre>

sample.xsl

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="catalog"><xsl:for-each select="large-product">
Large|<xsl:value-of select="name"/>|<xsl:value-of select="price"/>
</xsl:for-each>
<xsl:for-each select="small-product">
Small|<xsl:value-of select="name"/>|<xsl:value-of select="price"/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

sample.yml

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  sample:
     TYPE: input
     CONTENT: data
     COMMAND: /usr/bin/xsltproc sample.xsl %filename%

Start gpfdist:

gpfdist -c sample.yml -p 8080 >> sample.log 2>&1 < sample.log &

Create External Table

create external table sample
(product_type text, product_name text, product_price int)
location ('gpfdist://bigmac:8080/sample.xml#transform=sample')
format 'text' (delimiter '|' header);

Select the data.

gpdb=# select * from sample; 
product_type | product_name | product_price 
--------------+--------------+---------------
 Large        | foo1         |           110
 Large        | foo2         |           120
 Large        | foo3         |           130
 Large        | foo4         |           140
 Large        | foo5         |           150
 Small        | bar1         |            10
 Small        | bar2         |            20
 Small        | bar3         |            30
 Small        | bar4         |            40
 Small        | bar5         |            50
(10 rows)

This solution works great for parsing reasonably sized XML files into a relational format but if you have very large files, use Hadoop and review my other blog post here.

New Project: gplink

I created a new project that simplifies the process to create Greenplum or Hawq External Tables using gpfdist to stream data from any valid JDBC source. It is like pointing gpfdist at Sqoop to pull data without landing a file but gplink ensures that the data is cleansed first so that the data will be readable by Greenplum or Hawq.

This will work with PostgreSQL (yeah!), MySQL (yuck), DB2, Informix, etc. You will have to download all third party JDBC drivers separately.

gplink v0.0.2
Source Code

This is a new project so I’m looking forward to any feedback you can provide.

Outsourcer 5.0.5

I have made improvements around the starting and stopping of gpfdist processes. It now handles the situation where a gpfdist process doesn’t start properly and instead of failing, it will pick the next available port. I also completely separated osstart and uistart. The User Interface does not need to be running in order for Outsourcer to load tables in the Queue or select from Custom SQL Tables.

This version is a release candidate as testing has shown it is very stable.

Download
Documentation
Source Code