Here is a video demonstrating the comparison of HAWQ and Hive detailed in this blog post.
Here is the Github Repository with all of the scripts and data used in the demo. Look in the cms_demo directory.
And last but not least, the video:
I recently built a Virtual Machine running CentOS 6.4, Hortonworks installation of Ambari 1.7, Hortonworks 184.108.40.206, and Pivotal HAWQ 220.127.116.11. If you aren’t already familiar with with the Open Data Platform, it is the shared industry effort to standardize the core components of Hadoop (HDFS, YARN, MapReduce, and Ambari) that both Hortonworks and Pivotal are a part of. This means I have HAWQ running on the Hortonworks deployment and this is a supported configuration.
After I got this built, I decided to put a set of demo scripts I use for HAWQ in the VM. It contains a simple Star Schema of Center for Medicare and Medicaid Services (CMS) data. It consists of 5 Dimensions and a single claims Fact with 1 million claims.
The demo does the following:
1. Loads the claims Fact with 1 million claims
2. Create and loads the 5 Dimensions
3. Creates and loads a single “Sandbox” table that joins all Dimensions to build a 1NF table
4. Execute 3 basic ad-hoc queries
I then converted my scripts to do the same with Hive but I ran into one problem. CMS provides one table with a range of values rather than providing each distinct value. For example:
001|139|Infectious and parasitic diseases 140|239|Neoplasms 240|279|Endocrine, nutritional and metabolic diseases, and immunity disorders ...
With HAWQ, I used generate_series(int, int) in a simple SQL query to generate every value in the range. Unfortunately, Hive doesn’t support this function so I had two options; write my own UDF or borrow the transformed data from HAWQ. I chose the latter because I could get this done faster.
And now, the results!
Load CMS Table: 3.1 seconds
Dimensions: 1.6 seconds
Sandbox Table: 3.1 seconds
Gender Query: 0.9 second
Location Query: 0.72 second
ICD9 Query: 0.95 second
Total: 10.37 seconds
Load CMS Table: 44.5 seconds
Dimensions (4 of the 5): 3 minutes 42.6 seconds
Dimension Fix: 35.7 seconds
Sandbox Table: 45.6 seconds
Gender Query: 47.5 seconds
Location Query: 48.4 seconds
ICD9 Query: 48.8 seconds
Total: 8 minutes 13.2 seconds
Hive looks to be a good entry level tool for SQL on Hadoop. The performance isn’t great but it comes with every distribution of Hadoop and is easier to use than other Hadoop tools. But HAWQ will give you significantly better performance and this can be on either the Pivotal HD or the Hortonworks distributions. Both distributions are based on the ODP which delivers the standard Hadoop core so that HAWQ will work on either distribution.
In my next post, I plan on enabling TEZ for Hive to see what the performance benefit will be. It should help but I don’t think it will come close to the performance of HAWQ. We shall see.
Pivotal teams up with Hortonworks for the Open Data Platform. Pivotal is also making the data products (Gemfire, Greenplum Database, and HAWQ) all open source.
I’m really excited about the Pivotal data products this year. There are some great changes coming that will be announced February 17, 2015 at 11AM PST / 2 PM EST.
This is an online event you can register for here.
I recently worked with a customer that receives very large and complex XML files from external partners. This customer wanted the XML files parsed and available for SQL access so they can do reporting and analytics.
There are many ways to handle XML files but in this case in which I had very large files, I needed a cluster of machines and Hadoop is pretty good at that. The processing can be done with Map Reduce or a tool like Pig which simplifies Map Reduce.
Sample XML file.
<?xml version="1.0"?> <catalog> <large-product> <name>foo1</name> <price>110</price> </large-product> <large-product> <name>foo2</name> <price>120</price> </large-product> <large-product> <name>foo3</name> <price>130</price> </large-product> <large-product> <name>foo4</name> <price>140</price> </large-product> <large-product> <name>foo5</name> <price>150</price> </large-product> <small-product> <name>bar1</name> <price>10</price> </small-product> <small-product> <name>bar2</name> <price>20</price> </small-product> <small-product> <name>bar3</name> <price>30</price> </small-product> <small-product> <name>bar4</name> <price>40</price> </small-product> <small-product> <name>bar5</name> <price>50</price> </small-product> </catalog>
As you can see, I have two record sets of large products and small products but I just want the small products in a table.
Fist, put the raw XML data into Hadoop.
hdfs dfs -mkdir /demo4 hdfs dfs -put catalog.xml /demo4
Here is the Pig script.
REGISTER /usr/lib/gphd/pig/piggybank.jar; A = LOAD '/demo4/catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('small-product') AS (doc:chararray); clean = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<small-product>\\s*<name>(.*)</name>\\s*<price>(.*)</price>\\s*</small-product>')) AS (name:chararray,price:int); store clean into '/demo4/alt_small_data';
What Pig is doing for me is to first only get the small-product records. This only requires a single line in the script and is very useful. The next step is to use regular expressions to parse each tag. This is very painful to get right because Pig use Map Reduce to parse the data. This is powerful but relatively slow to iterate until you get it right. Even with a small file, each iteration took at least 30 seconds to execute and the full file took 22 minutes.
The last step is to create an External Table in HAWQ.
DROP EXTERNAL TABLE IF EXISTS ext_alt_demo4; CREATE EXTERNAL TABLE ext_alt_demo4 ( name text, price int ) LOCATION ( 'pxf://pivhdsne:50070/demo4/alt_small_data/part*?profile=HdfsTextSimple' ) FORMAT 'text' (delimiter E'\t');
And selecting the data in HAWQ.
SELECT * FROM ext_alt_demo4; name | price ------+------- bar1 | 10 bar2 | 20 bar3 | 30 bar4 | 40 bar5 | 50 (5 rows) Time: 127.334 ms
This was my first approach for XML parsing until I got frustrated with the many XML tags to create regular expressions for. The XML I had wasn’t as neat as my example so I had to re-run the Pig script over and over again for each slight modification to the parsing logic.
This the same basic process as Solution 1 but instead of parsing each record with regular expressions in Pig, I will create a single column and do the parsing with SQL in HAWQ.
Here is my Pig script.
REGISTER /usr/lib/gphd/pig/piggybank.jar; A = LOAD '/demo4/catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('small-product') AS (doc:chararray); clean = foreach A generate REPLACE(REPLACE(doc, '\\u000D', ''), '\\u000A', ''); store clean into '/demo4/small_data';
So instead of using regular expressions, I’m replacing carriage return and newline characters from the XML so that each record is in one row. Then I store that back in Hadoop.
Here is the External Table in HAWQ.
CREATE EXTERNAL TABLE ext_demo4 ( xml_data text ) LOCATION ( 'pxf://pivhdsne:50070/demo4/small_data/part*?profile=HdfsTextSimple' ) FORMAT 'TEXT' (delimiter E'\t');
I then created a simple SQL function to parse the data.
CREATE OR REPLACE FUNCTION fn_extract_xml_value(p_tag text, p_xml text) RETURNS TEXT AS $$ SELECT SPLIT_PART(SUBSTRING($2 FROM '<' || $1 || '>(.*)</' || $1 || '>'), '<', 1) $$ LANGUAGE SQL;
And my SQL statement that parses the data.
SELECT (fn_extract_xml_value('name', xml_data))::text AS name, (fn_extract_xml_value('price', xml_data))::int AS price FROM ext_demo4; name | price ------+------- bar1 | 10 bar2 | 20 bar3 | 30 bar4 | 40 bar5 | 50 (5 rows) Time: 94.887 ms
The benefit for me in this second approach is the huge performance increase in the iterative approach of getting the XML parsing correct. Instead of taking several minutes to validate my code in Pig, I could execute a SQL statement that takes less than 1 second to run. It took another quick second to modify the SQL function and then I would try again.
Hadoop is powerful and has become commodity software with many distributions available that are all pretty much the same. The difference in distributions is the software that is unique to each vendor. Some vendors rely on their management tools while Pivotal HD has HAWQ which is the most robust SQL engine for Hadoop. This example shows how you can leverage the built-in functionality of Hadoop plus HAWQ to be more productive compared to using any other Hadoop distribution.
Fully supported and all you can eat Hadoop is now available with the Big Data Suite. Mix and match HAWQ, Greenplum, and Gemfire too. This suite makes it so easy for companies to explore Pivotal’s data products to find the right mix.
This video demonstrates a way to load publicly available IP Address location information information into Hadoop and match it to log activity. This would be useful for matching web log activity to the actual location of clients.
Outsourcer 3.1 is now available for download on the Installation page. Updated Documentation is also available on the Documentation page.
DDL Refresh Type
New Name: PivotalGuru.com
Greenplum is now part of Pivotal and to follow this name change, I now have the PivotalGuru.com site!
Pivotal is an exciting and rather large startup company with investments from EMC, VMWare, and GE. I am proud to be part of this new company and very excited about the new offerings in Big Data and Cloud Computing that Pivotal will bring to market.