Michael Goddard, who is a colleague of mine, put together this demo of the Self-Healing feature I created for the Pivotal Greenplum on AWS Marketplace products. If any node in the cluster fails, it will automatically be replaced and correctly added back!
I’ve been using Greenplum since version 3.0 back in 2007. That version was actually the first commercially available version too. The versions were released about once every 6 to 12 months and got as high as 3.3.
Once the version got to 4.3, the number just seemed to get stuck. 4.3.0 was a pretty big release which changed “Append Only” tables to be “Append Optimized” which simply meant you could start updating and deleting tables stored in the append-only format. But more enhancements came to 4.3 but the version number never exceeded 4.3.
Major enhancements came to version 4.3.x and the innovation came at a faster pace but you may not have noticed this if you were just looking at the version number. Pivotal’s engineering team embraced the Pivotal Labs methodologies of pair programming and quick iterations. There was a huge transformation happening in this group.
When 5.0 came out, it was a big deal. Greeplum had always been a fork of PostgreSQL 8.2 but with 5.0, Greenplum was rebased to 8.3. A lot of work went into this change which also requires migrating from 4.3.x to 5.0.
Now this next change, I didn’t expect. During the 4.3.x development, the only time the version number would change to either 4.4 or 5.0, it would be when the database required a migration to upgrade. With 5.x, the version numbers are coming fast and don’t require migrations to upgrade. It just a simple binary swap to upgrade.
The result has been the release of 5.0 on 2017-09-14, 5.1 on 2017-10-20, and 5.2 on 2017-11-18. Do you see the pattern? Monthly point releases! All of these releases so far have been simple binary upgrades but have a ton of improvements each time.
Version 6 is in sight. Seriously. I would expect this to be the next rebase of Greenplum to PostgreSQL 8.4 and will require a database migration to upgrade. It is amazing how the Labs culture and open source software development has enhanced the development of Greenplum.
Greenplum is a Massively Parallel Processing database which means the data is distributed across multiple segments on multiple hosts. Each segment process stores a catalog of objects and it is possible to get this out of sync because of hardware failures or killing processes. For example, if you are vacuuming a table and then decide to kill the process on the Master host, the segments may not get the signal properly and rollback the transaction. This can happen when using “kill -9” rather than just “kill” to stop a process.
You may get this error message if a table is corrupt:
gpadmin=# select * from corrupt; ERROR: relation with OID 109802 does not exist (seg0 slice1 gpdbsne:40000 pid=78502)
So you decide to drop the table and start over but that doesn’t work either.
gpadmin=# drop table corrupt; ERROR: table "corrupt" does not exist (seg0 gpdbsne:40000 pid=78502)
The easy fix for this is to use “if exists” in the drop command.
gpadmin=# drop table if exists corrupt; DROP TABLE Time: 10.423 ms
Now the table is gone and you can recreate it. Next time, don’t use kill -9 and don’t kill vacuum commands.
If you are wondering how I managed to corrupt this table, here are the steps.
1. Create the table
gpadmin=# create table corrupt (id int, foo text, bar text) distributed by (id); CREATE TABLE Time: 31.195 ms
2. Insert some data
insert into corrupt select i as id, 'foo_' || i, 'bar_' || i from generate_series(1,100) as i; INSERT 0 100 Time: 46.419 ms gpadmin=# analyze corrupt; ANALYZE Time: 108.193 ms gpadmin=# \q
3. Connect to a segment directly. I’m using a single node for this test.
[gpadmin@gpdbsne ~]$ ps -ef | grep postgres | grep M gpadmin 78213 1 0 09:49 ? 00:00:00 /usr/local/greenplum-db-184.108.40.206/bin/postgres -D /data/primary/gpseg1 -p 40001 -b 3 -z 2 --silent-mode=true -i -M mirrorless -C 1 gpadmin 78214 1 0 09:49 ? 00:00:00 /usr/local/greenplum-db-220.127.116.11/bin/postgres -D /data/primary/gpseg0 -p 40000 -b 2 -z 2 --silent-mode=true -i -M mirrorless -C 0 gpadmin 78239 1 0 09:49 ? 00:00:00 /usr/local/greenplum-db-18.104.22.168/bin/postgres -D /data/master/gpseg-1 -p 5432 -b 1 -z 2 --silent-mode=true -i -M master -C -1 -x 0 -E
I see that the segments are using ports 40,000 and 40,001 while the master is using 5432. I’ll connect to seg0 on port 40,000 and then drop the table. This only drops the object in this one segment. I’ll also do a quick query to make sure I only see half of the data because I have 2 segments.
PGOPTIONS='-c gp_session_role=utility' psql -p 40000 psql (8.2.15) Type "help" for help. gpadmin=# select count(*) from corrupt; count ------- 50 (1 row) Time: 3.485 ms
4. Corrupt the database by dropping the table only on one segment.
drop table corrupt; DROP TABLE Time: 9.410 ms gpadmin=# \q
5. Confirm that the table is corrupt.
[gpadmin@gpdbsne ~]$ psql Timing is on. psql (8.2.15) Type "help" for help. gpadmin=# select * from corrupt; ERROR: relation with OID 109802 does not exist (seg0 slice1 gpdbsne:40000 pid=78502)
And again, the fix:
gpadmin=# drop table if exists corrupt; DROP TABLE Time: 10.423 ms
Pivotal HDB 2.0.1 (based on Apache HAWQ)
The newest release of Pivotal HDB is fantastic! It adds new features and resolves some issues as well. Here are the release notes.
One resolved issue affected Outsourcer which is a JDBC problem and is documented here: HAWQ-738. Pivotal HDB 2.0.0 was released in May and I found that Outsourcer suffered from stability issues under load and was generally slower navigating in the UI than HDB 1.3 or Greenplum Database. The issue was quickly resolved in that same month but the fix wasn’t publicly available until October of 2016 with this new release.
Quicklz is a compression library that uses GPL licensing. It is bundled with Greenplum Database and Pivotal HDB 1.3 and 2.0.0. Starting with Pivotal HDB 2.0.1, Quicklz has been removed. Why? Because of that GPL license and HAWQ is an Apache project.
In HDB 1.3 and 2.0.0, the guidance was to use Quicklz compression for row oriented tables but starting with 2.0.1, you should use Snappy compression. It is an easy change too:
CREATE TABLE mytable (id INT, fname TEXT) WITH (APPENDONLY=TRUE, COMPRESSTYPE=SNAPPY) DISTRIBUTED RANDOMLY;
Note: When upgrading to 2.0.1, be sure to first change your Quicklz compressed tables to either Zlib, Parquet with Snappy, or no compression at all. Then upgrade and then you can change back to row orientation and use Snappy compression. More details are in this section of the release notes.
Future releases of Pivotal HDB should come quicker because the licensing hurdle is now complete. This means developers can focus on enhancements and fixes rather than licensing.
This new release officially supports Pivotal HDB 2.0.1. It makes the compression for row oriented tables to now use Snappy instead of Quicklz. If you are using Pivotal HDB 2.0.0 or Greenplum Database, Outsourcer will still use Quicklz compression.
Please consider upgrading to Pivotal HDB 2.0.0 to 2.0.1 especially if you are using Outsourcer. When I test Outsourcer with Pivotal HDB 2.0, I use build 22425 rather than build 22126 which is what you can download from Pivotal’s site. 22126 has the JDBC bug while 22425 and new builds do not. And when you upgrade to 2.0.1, also upgrade to Outsourcer 5.2.0.
Hive is probably the most popular SQL engine for Hadoop but not because it is fast nor does it support SQL used by most organizations. It is just the oldest. Or is it?
Pivotal HDB, powered by Apache HAWQ (incubating), is much faster and has much better SQL language support. Pivotal HDB gets its roots from Greenplum database which has been around longer so the SQL language support is much more mature.
In all of the performance tests I’ve done comparing Pivotal HDB with Apache Hive, the results are dramatic. Pivotal HDB loads data significantly faster and executes queries significantly faster too. And yes, I’ve tested it with Tez and even with LLAP. Hive doesn’t come close to the performance of Pivotal HDB. More will be published on this subject soon.
SQL Language Support
The real reason for this post is the lack of SQL language support that Hive has. This makes it more difficult to write SQL in this environment and it makes it more difficult to migrate from a legacy RDBMS to Hadoop. Here are some examples.
Example 1: Subqueries in the WHERE Clause
This is query 6 from the TPC-DS benchmark:
select a.ca_state state, count(*) cnt from customer_address a ,customer c ,store_sales s ,date_dim d ,item i where a.ca_address_sk = c.c_current_addr_sk and c.c_customer_sk = s.ss_customer_sk and s.ss_sold_date_sk = d.d_date_sk and s.ss_item_sk = i.i_item_sk and d.d_month_seq = (select distinct (d_month_seq) from date_dim where d_year = 1998 and d_moy = 2 ) and i.i_current_price > 1.2 * (select avg(j.i_current_price) from item j where j.i_category = i.i_category) group by a.ca_state having count(*) >= 10 order by cnt limit 100;
This query won’t work in Hive because of the two subqueries in the WHERE clause. You will have to rewrite this query like this:
select a.ca_state state, count(*) cnt from customer_address a ,customer c ,store_sales s ,date_dim d ,item i , (select distinct (d_month_seq) as d_month_seq from date_dim where d_year = 1998 and d_moy = 2 ) as sq1 , (select j.i_category, avg(j.i_current_price) as avg_i_current_price from item j group by j.i_category) as sq2 where a.ca_address_sk = c.c_current_addr_sk and c.c_customer_sk = s.ss_customer_sk and s.ss_sold_date_sk = d.d_date_sk and s.ss_item_sk = i.i_item_sk and d.d_month_seq = sq1.d_month_seq and sq2.i_category = i.i_category and i.i_current_price > 1.2 * sq2.avg_i_current_price group by a.ca_state having count(*) >= 10 order by cnt limit 100;
It took me a few minutes to rewrite this and I’m still not 100% sure it is right. In a conversion effort, I would have to execute it in the legacy RDBMS and see if the results are the same as in Hive. If the legacy RDBMS is changing, then I would have a very difficult time with just a single query.
If I was doing this for benchmarking purposes, I would have to execute TPC-DS in another database that does support this query with the same data to validate the results. Clearly, the lack of SQL language support is a problem.
Example 2: Subqueries in the SELECT Clause
This is query 9 from the TPC-DS benchmark:
select case when (select count(*) from store_sales where ss_quantity between 1 and 20) > 31003 then (select avg(ss_ext_list_price) from store_sales where ss_quantity between 1 and 20) else (select avg(ss_net_paid_inc_tax) from store_sales where ss_quantity between 1 and 20) end bucket1 , case when (select count(*) from store_sales where ss_quantity between 21 and 40) > 24212 then (select avg(ss_ext_list_price) from store_sales where ss_quantity between 21 and 40) else (select avg(ss_net_paid_inc_tax) from store_sales where ss_quantity between 21 and 40) end bucket2, case when (select count(*) from store_sales where ss_quantity between 41 and 60) > 28398 then (select avg(ss_ext_list_price) from store_sales where ss_quantity between 41 and 60) else (select avg(ss_net_paid_inc_tax) from store_sales where ss_quantity between 41 and 60) end bucket3, case when (select count(*) from store_sales where ss_quantity between 61 and 80) > 21646 then (select avg(ss_ext_list_price) from store_sales where ss_quantity between 61 and 80) else (select avg(ss_net_paid_inc_tax) from store_sales where ss_quantity between 61 and 80) end bucket4, case when (select count(*) from store_sales where ss_quantity between 81 and 100) > 4078 then (select avg(ss_ext_list_price) from store_sales where ss_quantity between 81 and 100) else (select avg(ss_net_paid_inc_tax) from store_sales where ss_quantity between 81 and 100) end bucket5 from reason where r_reason_sk = 1
You get this error:
FAILED: ParseException line 2:18 cannot recognize input near '(' 'select' 'count' in expression specification (state=42000,code=40000)
The fix is to move all of those subqueries from the SELECT to the FROM section of the query. Or you can write an elaborate CASE statement. Again, this takes time to rewrite and even more time to validate.
Example 3: Correlated Subqueries
This is query 10 from the TPC-DS benchmark:
select cd_gender, cd_marital_status, cd_education_status, count(*) cnt1, cd_purchase_estimate, count(*) cnt2, cd_credit_rating, count(*) cnt3, cd_dep_count, count(*) cnt4, cd_dep_employed_count, count(*) cnt5, cd_dep_college_count, count(*) cnt6 from customer c,customer_address ca,customer_demographics where c.c_current_addr_sk = ca.ca_address_sk and ca_county in ('Clinton County','Platte County','Franklin County','Louisa County','Harmon County') and cd_demo_sk = c.c_current_cdemo_sk and exists (select * from store_sales,date_dim where c.c_customer_sk = ss_customer_sk and ss_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 3 and 3+3) and (exists (select * from web_sales,date_dim where c.c_customer_sk = ws_bill_customer_sk and ws_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 3 ANd 3+3) or exists (select * from catalog_sales,date_dim where c.c_customer_sk = cs_ship_customer_sk and cs_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 3 and 3+3)) group by cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count order by cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count limit 100;
This query correlates data between customer and the tables in each subquery. Rewriting this can be tricky because you need to make sure you don’t duplicate data when joining to the subqueries.
Example 4: INTERSECT
INTERSECT is a feature used in SQL that is similar to UNION or EXCEPT. It gets a distinct list of values from two queries that are contained in both queries. More information on this feature: https://www.postgresql.org/docs/8.2/static/sql-select.html#SQL-INTERSECT
This is query 8 from the TPC-DS benchmark:
select s_store_name ,sum(ss_net_profit) from store_sales ,date_dim ,store, (select ca_zip from ( (SELECT substr(ca_zip,1,5) ca_zip FROM customer_address WHERE substr(ca_zip,1,5) IN ( '89436','30868','65085','22977','83927','77557', '58429','40697','80614','10502','32779', '91137','61265','98294','17921','18427', '21203','59362','87291','84093','21505', '17184','10866','67898','25797','28055', '18377','80332','74535','21757','29742', '90885','29898','17819','40811','25990', '47513','89531','91068','10391','18846', '99223','82637','41368','83658','86199', '81625','26696','89338','88425','32200', '81427','19053','77471','36610','99823', '43276','41249','48584','83550','82276', '18842','78890','14090','38123','40936', '34425','19850','43286','80072','79188', '54191','11395','50497','84861','90733', '21068','57666','37119','25004','57835', '70067','62878','95806','19303','18840', '19124','29785','16737','16022','49613', '89977','68310','60069','98360','48649', '39050','41793','25002','27413','39736', '47208','16515','94808','57648','15009', '80015','42961','63982','21744','71853', '81087','67468','34175','64008','20261', '11201','51799','48043','45645','61163', '48375','36447','57042','21218','41100', '89951','22745','35851','83326','61125', '78298','80752','49858','52940','96976', '63792','11376','53582','18717','90226', '50530','94203','99447','27670','96577', '57856','56372','16165','23427','54561', '28806','44439','22926','30123','61451', '92397','56979','92309','70873','13355', '21801','46346','37562','56458','28286', '47306','99555','69399','26234','47546', '49661','88601','35943','39936','25632', '24611','44166','56648','30379','59785', '11110','14329','93815','52226','71381', '13842','25612','63294','14664','21077', '82626','18799','60915','81020','56447', '76619','11433','13414','42548','92713', '70467','30884','47484','16072','38936', '13036','88376','45539','35901','19506', '65690','73957','71850','49231','14276', '20005','18384','76615','11635','38177', '55607','41369','95447','58581','58149', '91946','33790','76232','75692','95464', '22246','51061','56692','53121','77209', '15482','10688','14868','45907','73520', '72666','25734','17959','24677','66446', '94627','53535','15560','41967','69297', '11929','59403','33283','52232','57350', '43933','40921','36635','10827','71286', '19736','80619','25251','95042','15526', '36496','55854','49124','81980','35375', '49157','63512','28944','14946','36503', '54010','18767','23969','43905','66979', '33113','21286','58471','59080','13395', '79144','70373','67031','38360','26705', '50906','52406','26066','73146','15884', '31897','30045','61068','45550','92454', '13376','14354','19770','22928','97790', '50723','46081','30202','14410','20223', '88500','67298','13261','14172','81410', '93578','83583','46047','94167','82564', '21156','15799','86709','37931','74703', '83103','23054','70470','72008','49247', '91911','69998','20961','70070','63197', '54853','88191','91830','49521','19454', '81450','89091','62378','25683','61869', '51744','36580','85778','36871','48121', '28810','83712','45486','67393','26935', '42393','20132','55349','86057','21309', '80218','10094','11357','48819','39734', '40758','30432','21204','29467','30214', '61024','55307','74621','11622','68908', '33032','52868','99194','99900','84936', '69036','99149','45013','32895','59004', '32322','14933','32936','33562','72550', '27385','58049','58200','16808','21360', '32961','18586','79307','15492')) intersect (select ca_zip from (SELECT substr(ca_zip,1,5) ca_zip,count(*) cnt FROM customer_address, customer WHERE ca_address_sk = c_current_addr_sk and c_preferred_cust_flag='Y' group by ca_zip having count(*) > 10)A1))A2) V1 where ss_store_sk = s_store_sk and ss_sold_date_sk = d_date_sk and d_qoy = 1 and d_year = 2002 and (substr(s_zip,1,2) = substr(V1.ca_zip,1,2)) group by s_store_name order by s_store_name limit 100;
The fix for this is to change the INTERSECT to UNION and make this another subquery that also does a GROUP BY so you have a distinct list of zip codes.
Of the 99 TPC-DS queries, 19 contains SQL that is not supported by Hive. Pivotal HDB can execute all 99 TPC-DS queries without modification. So not only it is much faster, it is easier to migrate from a legacy RDBMS to Hadoop and easier to start using because the SQL language support is so robust.
The genesis of the best SQL engine in Hadoop is not an overnight, “me too” product. It seems that it wasn’t too long ago the Hadoop vendors all but wrote off SQL but the demand just hasn’t gone away.
HAWQ is the result of many, many years of work leveraging open source software as well as contributing back to the open source community. I think a short history lesson is in order to fully understand how this product came to be.
Greenplum Database began life in 2003. The founders used open source PostgreSQL and released a commercial product soon after.
Bizgres, an open source version of Greenplum, was released in 2005. Very early on, the founders of Greenplum embraced contributing back to the open source community.
Madlib was released in 2010 as an open source project which later became an Apache Incubator project.
Greenplum was acquired by EMC in 2010 and almost immediately, EMC invested heavily into Hadoop. The Greenplum division was agile like a small startup company but with the deep pockets of a multi-billion dollar company.
Greenplum released a Hadoop distribution in 2011 and integration between Greenplum Database and HDFS got more robust with the introduction of “gphdfs”. Greenplum supported External Tables to read/write data in parallel to HDFS from several different distributions.
HAWQ, a fork of Greenplum Database, was released in 2013. HAWQ was immediately extremely performant and compliant with the newest SQL syntax. HAWQ borrowed from the 10 years experience of developing Greenplum to provide a robust optimizer designed for HDFS.
2013 also saw Pivotal become a company. EMC contributed Greenplum Database, VMWare contributed Gemfire and Cloud Foundry, and GE contributed capital as an active partner. Paul Maritiz became the CEO and the dedication to fully embrace open source became an integral part of the corporate culture.
During the last three years, HAWQ has become an Apache Incubator Project. The Pivotal product is now a rather boring name of “Pivotal HDB” while HAWQ is the name of the Apache project.
Pivotal also made Greenplum and Geode (Gemfire is the commercial product name) open source projects too. Clearly, Pivotal has embraced open source with probably more committers than those other “open source” data companies.
So what now? What is happening in 2016? Well, Pivotal is about to release Pivotal HDB (HAWQ) 2.0. I’ve been testing this product for months on various platforms and I keep getting amazed by the performance and ease of use.
HAWQ 2.0 embraces Hadoop fully. I believe the two biggest features are elasticity and performance. HAWQ now supports elasticity for growing or shrinking clusters without having to redistribute the data. The performance is also much improved as it better utilizes HDFS and YARN.
So you may be asking, “is it fast?” and the answer is yes! I haven’t found a SQL engine that is faster and I’ve been doing competitive analysis for months. The other question you may ask is, “can I run my SQL?” and the answer is yes! A major competitor in the SQL on Hadoop landscape requires more tweaking of SQL just to get the SQL to execute and more tuning to get decent performance.
That “other SQL on Hadoop” product can’t do the following things, as well as many more, that HAWQ can.
– Can’t handle a comment line at the end of SQL file (I found that rather strange)
– Won’t do partition elimination through a joined table (e.g. date dimension)
– Can’t get number of milliseconds when subtracting from a date. Only gets seconds.
– Has “interval” in SQL dialect but doesn’t have interval as a type.
– No time data type.
– Concatenating strings doesn’t use || or +. You must use concat() function.
– Doesn’t support intersect or except.
– Doesn’t support subqueries that return more than one row.
– Doesn’t support correlated subqueries.
– Doesn’t support group by rollup.
– Doesn’t support subqueries in having statement.
– Subqueries not supported in select list.
HAWQ is the real deal. It is an Apache project, going to be part of ODPi, faster than everyone else, integrated with Apache Ambari, certified with Hortonworks, and the most mature SQL engine for Hadoop.
So I was alerted to this blog post by a colleague and I was floored on what Cloudera is doing.
Basically, they are creating a new feature for objects that can store multiple levels in a single table. In a traditional relational structure, you may have an Orders table with another table for Order_Details. You would simply JOIN the two tables together when needed.
CREATE SCHEMA example; CREATE TABLE example.Orders (order_id int NOT NULL, order_date date NOT NULL, customer_id int NOT NULL); CREATE TABLE example.Order_Details (order_detail_id int NOT NULL, order_id int NOT NULL, product_id int NOT NULL, quantity int NOT NULL, price numeric NOT NULL);
And when you want to query both tables:
SELECT o.order_date, sum(od.price) FROM example.orders o JOIN example.order_details od ON o.order_id = od.order_id GROUP BY o.order_date;
What Cloudera is doing is an Object approach which combines both tables into a single table using a feature called a STRUCT. This is similar to a composite type in Greenplum, PostgreSQL, and HAWQ.
--Impala CREATE TABLE Orders (order_id int NOT NULL, order_date date NOT NULL, customer_id int NOT NULL, Order_Details ARRAY<STRUCT< order_detail_id: int, product_id: int, quantity: int, price: numeric>> );
I get this approach when working with a OLTP system and developers want to match their Classes to the database structure but I really don’t see the benefit of this in a big data platform. It just makes it difficult for Database Users to understand and use the data.
Every database I’ve ever worked with uses Schema.Table to organize table structures but not Impala! Cloudera has decided to break away from the decades old standard and use the dot notation for their nested fields stored within a STRUCT. If you adopt this silliness, you are adopting a standard that no one else uses. Why would Cloudera want that? Vendor lock-in maybe????
--Impala select * from orders.order_details;
Orders isn’t the schema in the above example. This is a table. Cloudera just broke SQL!
Cartesian Product? Nope, Just Vendor Specific Syntax
Notice how this query appears to reference two different tables without specifying a join.
SELECT o.order_date, sum(od.price) FROM orders o, o.order_details od GROUP BY o.order_date;
They even allow you to do an OUTER JOIN to the STRUCT ARRAY without defining the columns to join on.
SELECT o.order_date, sum(od.price) FROM orders o LEFT OUTER JOIN o.order_details od GROUP BY o.order_date;
Reasons Why Cloudera is Doing This
Here are my guesses as to why they are doing this.
1. They hate SQL or just don’t understand it.
2. Vendor Lock-In. They need ways to make it difficult for customers to switch to a competitor’s product.
3. Impala is bad at joins. Organizing tables like this make it easier for Impala’s query optimizer to create a robust query plan.
Please Cloudera, just stop!
I am honored to have a blog post on the Pivotal Blog!
Starting with Greenplum Database 22.214.171.124, there is a new Query Optimizer that is now available that is faster than the legacy optimizer.
Greenplum Database is a fork of PostgreSQL 8.2. This fork was done to optimize SQL queries for a distributed cluster and it has done very well over the years. Often times Greenplum faster that competing MPP databases because of this Query Optimizer. But, because it has many years of development on top of PostgreSQL, it has become difficult to manage and enhance.
Orca was the code name of the new Query Optimizer that is a refactor of the legacy Query Optimizer that also includes enhancements for performance and maintainability. This post is about how to enable the new Query Optimizer.
Note: You must be using Greenplum Database 126.96.36.199 or newer.
1. Log into the Master server with ssh as gpadmin.
2. Make sure the database is running and healthy.
3. Execute gpconfig:
gpconfig -c optimizer_analyze_root_partition -v on --masteronly
4. Enable the new optimizer for the entire system.
gpconfig -c optimizer -v on --masteronly
5. Apply the changes with gpstop.
6. Disable the new Query Optimizer for the gpperfmon database.
ALTER DATABASE gpperfmon SET OPTIMIZER = OFF;
7. Verify the new Query Optimizer is on.
[gpadmin@mdw ~]$ psql psql (8.2.15) Type "help" for help. gpdb=# explain select version(); QUERY PLAN ------------------------------------------------ Result (cost=0.00..0.00 rows=1 width=8) -> Result (cost=0.00..0.00 rows=1 width=1) Settings: optimizer=on (3 rows)
8. Using psql, change to the gpperfmon database and make sure the new Query Optimizer is off.
gpdb=# \c gpperfmon You are now connected to database "gpperfmon" as user "gpadmin". gpperfmon=# explain select version(); QUERY PLAN ------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0) Settings: optimizer=off (2 rows) gpperfmon=#
I recently built a Virtual Machine running CentOS 6.4, Hortonworks installation of Ambari 1.7, Hortonworks 188.8.131.52, and Pivotal HAWQ 184.108.40.206. If you aren’t already familiar with with the Open Data Platform, it is the shared industry effort to standardize the core components of Hadoop (HDFS, YARN, MapReduce, and Ambari) that both Hortonworks and Pivotal are a part of. This means I have HAWQ running on the Hortonworks deployment and this is a supported configuration.
After I got this built, I decided to put a set of demo scripts I use for HAWQ in the VM. It contains a simple Star Schema of Center for Medicare and Medicaid Services (CMS) data. It consists of 5 Dimensions and a single claims Fact with 1 million claims.
The demo does the following:
1. Loads the claims Fact with 1 million claims
2. Create and loads the 5 Dimensions
3. Creates and loads a single “Sandbox” table that joins all Dimensions to build a 1NF table
4. Execute 3 basic ad-hoc queries
I then converted my scripts to do the same with Hive but I ran into one problem. CMS provides one table with a range of values rather than providing each distinct value. For example:
001|139|Infectious and parasitic diseases 140|239|Neoplasms 240|279|Endocrine, nutritional and metabolic diseases, and immunity disorders ...
With HAWQ, I used generate_series(int, int) in a simple SQL query to generate every value in the range. Unfortunately, Hive doesn’t support this function so I had two options; write my own UDF or borrow the transformed data from HAWQ. I chose the latter because I could get this done faster.
And now, the results!
Load CMS Table: 3.1 seconds
Dimensions: 1.6 seconds
Sandbox Table: 3.1 seconds
Gender Query: 0.9 second
Location Query: 0.72 second
ICD9 Query: 0.95 second
Total: 10.37 seconds
Load CMS Table: 44.5 seconds
Dimensions (4 of the 5): 3 minutes 42.6 seconds
Dimension Fix: 35.7 seconds
Sandbox Table: 45.6 seconds
Gender Query: 47.5 seconds
Location Query: 48.4 seconds
ICD9 Query: 48.8 seconds
Total: 8 minutes 13.2 seconds
Hive looks to be a good entry level tool for SQL on Hadoop. The performance isn’t great but it comes with every distribution of Hadoop and is easier to use than other Hadoop tools. But HAWQ will give you significantly better performance and this can be on either the Pivotal HD or the Hortonworks distributions. Both distributions are based on the ODP which delivers the standard Hadoop core so that HAWQ will work on either distribution.
In my next post, I plan on enabling TEZ for Hive to see what the performance benefit will be. It should help but I don’t think it will come close to the performance of HAWQ. We shall see.
Pivotal teams up with Hortonworks for the Open Data Platform. Pivotal is also making the data products (Gemfire, Greenplum Database, and HAWQ) all open source.
I’m really excited about the Pivotal data products this year. There are some great changes coming that will be announced February 17, 2015 at 11AM PST / 2 PM EST.
This is an online event you can register for here.
I recently worked with a customer that receives very large and complex XML files from external partners. This customer wanted the XML files parsed and available for SQL access so they can do reporting and analytics.
There are many ways to handle XML files but in this case in which I had very large files, I needed a cluster of machines and Hadoop is pretty good at that. The processing can be done with Map Reduce or a tool like Pig which simplifies Map Reduce.
- Load raw file to Hadoop
- Transform XML to tab delimited file with Pig
- Create External Table in HAWQ to read file data in Hadoop
Sample XML file.
<?xml version="1.0"?> <catalog> <large-product> <name>foo1</name> <price>110</price> </large-product> <large-product> <name>foo2</name> <price>120</price> </large-product> <large-product> <name>foo3</name> <price>130</price> </large-product> <large-product> <name>foo4</name> <price>140</price> </large-product> <large-product> <name>foo5</name> <price>150</price> </large-product> <small-product> <name>bar1</name> <price>10</price> </small-product> <small-product> <name>bar2</name> <price>20</price> </small-product> <small-product> <name>bar3</name> <price>30</price> </small-product> <small-product> <name>bar4</name> <price>40</price> </small-product> <small-product> <name>bar5</name> <price>50</price> </small-product> </catalog>
As you can see, I have two record sets of large products and small products but I just want the small products in a table.
Fist, put the raw XML data into Hadoop.
hdfs dfs -mkdir /demo4 hdfs dfs -put catalog.xml /demo4
Here is the Pig script.
REGISTER /usr/lib/gphd/pig/piggybank.jar; A = LOAD '/demo4/catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('small-product') AS (doc:chararray); clean = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<small-product>\\s*<name>(.*)</name>\\s*<price>(.*)</price>\\s*</small-product>')) AS (name:chararray,price:int); store clean into '/demo4/alt_small_data';
What Pig is doing for me is to first only get the small-product records. This only requires a single line in the script and is very useful. The next step is to use regular expressions to parse each tag. This is very painful to get right because Pig use Map Reduce to parse the data. This is powerful but relatively slow to iterate until you get it right. Even with a small file, each iteration took at least 30 seconds to execute and the full file took 22 minutes.
The last step is to create an External Table in HAWQ.
DROP EXTERNAL TABLE IF EXISTS ext_alt_demo4; CREATE EXTERNAL TABLE ext_alt_demo4 ( name text, price int ) LOCATION ( 'pxf://pivhdsne:50070/demo4/alt_small_data/part*?profile=HdfsTextSimple' ) FORMAT 'text' (delimiter E'\t');
And selecting the data in HAWQ.
SELECT * FROM ext_alt_demo4; name | price ------+------- bar1 | 10 bar2 | 20 bar3 | 30 bar4 | 40 bar5 | 50 (5 rows) Time: 127.334 ms
This was my first approach for XML parsing until I got frustrated with the many XML tags to create regular expressions for. The XML I had wasn’t as neat as my example so I had to re-run the Pig script over and over again for each slight modification to the parsing logic.
This the same basic process as Solution 1 but instead of parsing each record with regular expressions in Pig, I will create a single column and do the parsing with SQL in HAWQ.
Here is my Pig script.
REGISTER /usr/lib/gphd/pig/piggybank.jar; A = LOAD '/demo4/catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('small-product') AS (doc:chararray); clean = foreach A generate REPLACE(REPLACE(doc, '\\u000D', ''), '\\u000A', ''); store clean into '/demo4/small_data';
So instead of using regular expressions, I’m replacing carriage return and newline characters from the XML so that each record is in one row. Then I store that back in Hadoop.
Here is the External Table in HAWQ.
CREATE EXTERNAL TABLE ext_demo4 ( xml_data text ) LOCATION ( 'pxf://pivhdsne:50070/demo4/small_data/part*?profile=HdfsTextSimple' ) FORMAT 'TEXT' (delimiter E'\t');
I then created a simple SQL function to parse the data.
CREATE OR REPLACE FUNCTION fn_extract_xml_value(p_tag text, p_xml text) RETURNS TEXT AS $$ SELECT SPLIT_PART(SUBSTRING($2 FROM '<' || $1 || '>(.*)</' || $1 || '>'), '<', 1) $$ LANGUAGE SQL;
And my SQL statement that parses the data.
SELECT (fn_extract_xml_value('name', xml_data))::text AS name, (fn_extract_xml_value('price', xml_data))::int AS price FROM ext_demo4; name | price ------+------- bar1 | 10 bar2 | 20 bar3 | 30 bar4 | 40 bar5 | 50 (5 rows) Time: 94.887 ms
The benefit for me in this second approach is the huge performance increase in the iterative approach of getting the XML parsing correct. Instead of taking several minutes to validate my code in Pig, I could execute a SQL statement that takes less than 1 second to run. It took another quick second to modify the SQL function and then I would try again.
Hadoop is powerful and has become commodity software with many distributions available that are all pretty much the same. The difference in distributions is the software that is unique to each vendor. Some vendors rely on their management tools while Pivotal HD has HAWQ which is the most robust SQL engine for Hadoop. This example shows how you can leverage the built-in functionality of Hadoop plus HAWQ to be more productive compared to using any other Hadoop distribution.
Fully supported and all you can eat Hadoop is now available with the Big Data Suite. Mix and match HAWQ, Greenplum, and Gemfire too. This suite makes it so easy for companies to explore Pivotal’s data products to find the right mix.
This video demonstrates a way to load publicly available IP Address location information information into Hadoop and match it to log activity. This would be useful for matching web log activity to the actual location of clients.
Outsourcer 3.1 is now available for download on the Installation page. Updated Documentation is also available on the Documentation page.
- FetchSize is now configurable. To minimize round trips to Oracle, make this setting rather large (greater than 10,000) which increases exporting speed but at the expense of needing more RAM. Adjust this setting up or down based on your environment. Default is 40,000.
- Fix to better handle non-default schemas
DDL Refresh Type
- Several customers have requested for Outsourcer to just create the tables in Greenplum based on the DDL in the source without loading the tables. This new Refresh Type does just that.
New Name: PivotalGuru.com
Greenplum is now part of Pivotal and to follow this name change, I now have the PivotalGuru.com site!
Pivotal is an exciting and rather large startup company with investments from EMC, VMWare, and GE. I am proud to be part of this new company and very excited about the new offerings in Big Data and Cloud Computing that Pivotal will bring to market.