I recently built a Virtual Machine running CentOS 6.4, Hortonworks installation of Ambari 1.7, Hortonworks 220.127.116.11, and Pivotal HAWQ 18.104.22.168. If you aren’t already familiar with with the Open Data Platform, it is the shared industry effort to standardize the core components of Hadoop (HDFS, YARN, MapReduce, and Ambari) that both Hortonworks and Pivotal are a part of. This means I have HAWQ running on the Hortonworks deployment and this is a supported configuration.
After I got this built, I decided to put a set of demo scripts I use for HAWQ in the VM. It contains a simple Star Schema of Center for Medicare and Medicaid Services (CMS) data. It consists of 5 Dimensions and a single claims Fact with 1 million claims.
The demo does the following:
1. Loads the claims Fact with 1 million claims
2. Create and loads the 5 Dimensions
3. Creates and loads a single “Sandbox” table that joins all Dimensions to build a 1NF table
4. Execute 3 basic ad-hoc queries
I then converted my scripts to do the same with Hive but I ran into one problem. CMS provides one table with a range of values rather than providing each distinct value. For example:
001|139|Infectious and parasitic diseases 140|239|Neoplasms 240|279|Endocrine, nutritional and metabolic diseases, and immunity disorders ...
With HAWQ, I used generate_series(int, int) in a simple SQL query to generate every value in the range. Unfortunately, Hive doesn’t support this function so I had two options; write my own UDF or borrow the transformed data from HAWQ. I chose the latter because I could get this done faster.
And now, the results!
Load CMS Table: 3.1 seconds
Dimensions: 1.6 seconds
Sandbox Table: 3.1 seconds
Gender Query: 0.9 second
Location Query: 0.72 second
ICD9 Query: 0.95 second
Total: 10.37 seconds
Load CMS Table: 44.5 seconds
Dimensions (4 of the 5): 3 minutes 42.6 seconds
Dimension Fix: 35.7 seconds
Sandbox Table: 45.6 seconds
Gender Query: 47.5 seconds
Location Query: 48.4 seconds
ICD9 Query: 48.8 seconds
Total: 8 minutes 13.2 seconds
Hive looks to be a good entry level tool for SQL on Hadoop. The performance isn’t great but it comes with every distribution of Hadoop and is easier to use than other Hadoop tools. But HAWQ will give you significantly better performance and this can be on either the Pivotal HD or the Hortonworks distributions. Both distributions are based on the ODP which delivers the standard Hadoop core so that HAWQ will work on either distribution.
In my next post, I plan on enabling TEZ for Hive to see what the performance benefit will be. It should help but I don’t think it will come close to the performance of HAWQ. We shall see.