Category Archives: Hadoop

HAWQ versus Hive on HDP 2.2.4.2 (Part 2)

Part 1 covered the comparison of HAWQ versus Hive using Map-Reduce instead of Tez. Now in Part 2, I will execute the same scripts but with Tez enabled.

Enabling Tez is pretty easy to do with HDP 2.2.4.2. You simply change this in the General section under Hive and restart Hive:

hive.execution.engine = tez

Next, I re-ran the same loads and queries. Overall, it is roughly 25% faster with Tez enabled but it is still much slower than HAWQ.

Here is the graph of results:
HAWQ vs Hive 2

Observations
– Loading was slower with Tez enabled but this is probably because I’m testing with a VM.
– Every query was about 40% faster in Hive with Tez versus without.
– HAWQ was about 30 times faster than Hive with Tez and about 57 times faster than Hive with Map-Reduce.
– The execute time reported by Hive was incorrect for each query. For example, I used “time hive -f icd9.sql” to execute a the icd9.sql query and capture the timings. Time reported:

real    0m25.060s

but Hive reported:

Time taken: 13.17 seconds, Fetched: 19 row(s)

So another test but the same result. Hive is much, much slower than HAWQ and even with Tez enabled. If you want Enterprise level SQL for Hadoop, HAWQ is the best solution.

HAWQ versus Hive on HDP 2.2.4.2 (Part 1)

I recently built a Virtual Machine running CentOS 6.4, Hortonworks installation of Ambari 1.7, Hortonworks 2.2.4.2, and Pivotal HAWQ 1.3.0.1. If you aren’t already familiar with with the Open Data Platform, it is the shared industry effort to standardize the core components of Hadoop (HDFS, YARN, MapReduce, and Ambari) that both Hortonworks and Pivotal are a part of. This means I have HAWQ running on the Hortonworks deployment and this is a supported configuration.

After I got this built, I decided to put a set of demo scripts I use for HAWQ in the VM. It contains a simple Star Schema of Center for Medicare and Medicaid Services (CMS) data. It consists of 5 Dimensions and a single claims Fact with 1 million claims. CMS Star Schema

The demo does the following:
1. Loads the claims Fact with 1 million claims
2. Create and loads the 5 Dimensions
3. Creates and loads a single “Sandbox” table that joins all Dimensions to build a 1NF table
4. Execute 3 basic ad-hoc queries

I then converted my scripts to do the same with Hive but I ran into one problem. CMS provides one table with a range of values rather than providing each distinct value. For example:

001|139|Infectious and parasitic diseases
140|239|Neoplasms
240|279|Endocrine, nutritional and metabolic diseases, and immunity disorders
...

With HAWQ, I used generate_series(int, int) in a simple SQL query to generate every value in the range. Unfortunately, Hive doesn’t support this function so I had two options; write my own UDF or borrow the transformed data from HAWQ. I chose the latter because I could get this done faster.

And now, the results!

HAWQ Timings
Load CMS Table: 3.1 seconds
Dimensions: 1.6 seconds
Sandbox Table: 3.1 seconds
Gender Query: 0.9 second
Location Query: 0.72 second
ICD9 Query: 0.95 second
Total: 10.37 seconds

Hive Timings
Load CMS Table: 44.5 seconds
Dimensions (4 of the 5): 3 minutes 42.6 seconds
Dimension Fix: 35.7 seconds
Sandbox Table: 45.6 seconds
Gender Query: 47.5 seconds
Location Query: 48.4 seconds
ICD9 Query: 48.8 seconds
Total: 8 minutes 13.2 seconds

This graph is very telling.
HAWQ vs Hive

Hive looks to be a good entry level tool for SQL on Hadoop. The performance isn’t great but it comes with every distribution of Hadoop and is easier to use than other Hadoop tools. But HAWQ will give you significantly better performance and this can be on either the Pivotal HD or the Hortonworks distributions. Both distributions are based on the ODP which delivers the standard Hadoop core so that HAWQ will work on either distribution.

In my next post, I plan on enabling TEZ for Hive to see what the performance benefit will be. It should help but I don’t think it will come close to the performance of HAWQ. We shall see.

XML Loading Again

I covered this topic here before with a solution that handles very large XML files in Hadoop. This blog post covers parsing XML as it is loaded in Hawq or Greenplum database.

sample.xml

<?xml version="1.0"?>
<catalog>
      <large-product>
         <name>foo1</name>
         <price>110</price>
      </large-product>
      <large-product>
         <name>foo2</name>
         <price>120</price>
      </large-product>
      <large-product>
         <name>foo3</name>
         <price>130</price>
      </large-product>
      <large-product>
         <name>foo4</name>
         <price>140</price>
      </large-product>
      <large-product>
         <name>foo5</name>
         <price>150</price>
      </large-product>
      <small-product>
         <name>bar1</name>
         <price>10</price>
      </small-product>
      <small-product>
         <name>bar2</name>
         <price>20</price>
      </small-product>
      <small-product>
         <name>bar3</name>
         <price>30</price>
      </small-product>      
      <small-product>
         <name>bar4</name>
         <price>40</price>
      </small-product>
      <small-product>
         <name>bar5</name>
         <price>50</price>
      </small-product>
</catalog>
</pre>

sample.xsl

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="catalog"><xsl:for-each select="large-product">
Large|<xsl:value-of select="name"/>|<xsl:value-of select="price"/>
</xsl:for-each>
<xsl:for-each select="small-product">
Small|<xsl:value-of select="name"/>|<xsl:value-of select="price"/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

sample.yml

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  sample:
     TYPE: input
     CONTENT: data
     COMMAND: /usr/bin/xsltproc sample.xsl %filename%

Start gpfdist:

gpfdist -c sample.yml -p 8080 >> sample.log 2>&1 < sample.log &

Create External Table

create external table sample
(product_type text, product_name text, product_price int)
location ('gpfdist://bigmac:8080/sample.xml#transform=sample')
format 'text' (delimiter '|' header);

Select the data.

gpdb=# select * from sample; 
product_type | product_name | product_price 
--------------+--------------+---------------
 Large        | foo1         |           110
 Large        | foo2         |           120
 Large        | foo3         |           130
 Large        | foo4         |           140
 Large        | foo5         |           150
 Small        | bar1         |            10
 Small        | bar2         |            20
 Small        | bar3         |            30
 Small        | bar4         |            40
 Small        | bar5         |            50
(10 rows)

This solution works great for parsing reasonably sized XML files into a relational format but if you have very large files, use Hadoop and review my other blog post here.

External Table “TRANSFORM” Option

In the Greenplum Administrator Guide, there is a section that covers loading XML data with gpload. It also mentions that you can create the External Table yourself. This demo will show you how to create an External Table that utilizes this feature so that you can execute any script you want on your ETL server. This demo also works for HAWQ.

XML Parsing is done sequentially from the top to the bottom of a file and the TRANSFORM option was built with this in mind. The gpfdist process will execute a script for you and the output is read by the segments in parallel. Instead of parsing XML with a script, this demo will execute a Unix command in a script to show you how you can leverage this feature to execute virtually any command you want on a remote server.

ETL Server
/data1/demo5/demo5.yml

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  transform_demo5:
     TYPE: input 
     CONTENT: data
     COMMAND: /bin/bash get_df.sh

As you can see, I created a transform named “transform_demo5” that executes the script “get_df.sh”. So let’s look at get_df.sh.
/data1/demo5/get_df.sh

df -k | awk '{print $1"|"$2"|"$3"|"$4"|"$5"|"$6}' | tail -n +2

This simple command executes the Unix df command and converts that to a pipe delimited file. Executing the command outputs this:

[pivhdsne:demo5]$ df -k | awk '{print $1"|"$2"|"$3"|"$4"|"$5"|"$6}' | tail -n +2
/dev/sda3|47472560|12080036|32981056|27%|/
tmpfs|4030752|0|4030752|0%|/dev/shm
/dev/sda1|99150|48764|45266|52%|/boot

Now start a gpfdist process that uses a configuration file. The “-c” option isn’t well documented with gpfdist but it is mentioned with the XML parsing.

gpfdist -p 8999 -c /data1/demo5/demo5.yml 2>&1 > demo5.log &

Now create an External Table. Notice the format of using #transform=transform_demo5. The filename of “foo” is ignored but you can reference the filename as a parameter to your transform scripts.

/data1/demo5/get_df.sql

CREATE EXTERNAL TABLE get_df 
(Filesystem text,
 K_blocks int,
 Used int,
 Available int,
 Used_percentage text,
 Mounted_on text)
LOCATION ('gpfdist://pivhdsne:8999/foo#transform=transform_demo5')
FORMAT 'TEXT' (DELIMITER '|');

Now create the table in your database server (Hawq or Greenplum).

psql -f get_df.sql -h gpdbvm43 
Password: 
Timing is on.
CREATE EXTERNAL TABLE
Time: 238.788 ms

Now let’s see the output.

[pivhdsne:demo5]$ psql -h gpdbvm43
Password: 
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from get_df;
 filesystem | k_blocks |   used   | available | used_percentage | mounted_on 
------------+----------+----------+-----------+-----------------+------------
 /dev/sda3  | 47472560 | 12104924 |  32956168 | 27%             | /
 tmpfs      |  4030752 |        0 |   4030752 | 0%              | /dev/shm
 /dev/sda1  |    99150 |    48764 |     45266 | 52%             | /boot
(3 rows)

Time: 32.149 ms

This is yet another feature of Greenplum and Hawq that gives you more flexibility in working with data. I’ve seen this feature used to move files, change permissions, get data from running programs, and of course parsing XML. Enjoy!

Open Data Platform

Pivotal teams up with Hortonworks for the Open Data Platform. Pivotal is also making the data products (Gemfire, Greenplum Database, and HAWQ) all open source.

http://blog.pivotal.io/big-data-pivotal/news-2/pivotal-big-data-suite-open-agile-cloud-ready

http://hortonworks.com/blog/pivotal-hortonworks-announce-alliance/

http://blog.pivotal.io/big-data-pivotal/news-2/why-the-open-data-platform-is-such-a-big-deal-for-big-data