Step 6 Testing

A quick way to demonstrate the performance and SQL compatibility of HAWQ is to execute the TPC-DS benchmark. Unlike the other SQL on Hadoop solutions, HAWQ can execute all 99 TPC-DS queries without alteration. Here is how you can easily test this in your VM.

The github repo is here: https://github.com/pivotalguru/TPC-DS and here is tutorial on how to use it.

ssh root@hdb
curl https://raw.githubusercontent.com/pivotalguru/TPC-DS/master/tpcds.sh > tpcds.sh
chmod 755 tpcds.sh
./tpcds.sh

You will see the message, “There are new variables in the tpcds_variables.sh file. Please review to ensure the values are correct and then re-run this script.” The initial execution of the script creates the tpcds_variables file which has defaults but it needs to be configured for the single node VM.

sed -i 's/GEN_DATA_SCALE="3000"/GEN_DATA_SCALE="1"/g' tpcds_variables.sh 
sed -i 's/RANDOM_DISTRIBUTION="false"/RANDOM_DISTRIBUTION="true"/g' tpcds_variables.sh 
sed -i 's/MULTI_USER_COUNT="5"/MULTI_USER_COUNT="3"/g' tpcds_variables.sh

This will create a single gigabyte of data, use random distribution which is recommended for HAWQ, and the multi-user test is changed to 3 because of the limited resources available in the single node.

Now, execute tpcds.sh again and the entire test with reports will execute.

./tpcds.sh

If you want to explore the scripts to see how the data was loaded and look at the queries, switch to the gpadmin account and go to the files.

su - gpadmin
cd /pivotalguru/TPC-DS/05_sql
cat 155.tpcds.55.sql
:EXPLAIN_ANALYZE
-- start query 82 in stream 0 using template query55.tpl and seed 1852232819
select  i_brand_id brand_id, i_brand brand,
 	sum(ss_ext_sales_price) ext_price
 from date_dim, store_sales, item
 where d_date_sk = ss_sold_date_sk
 	and ss_item_sk = i_item_sk
 	and i_manager_id=25
 	and d_moy=11
 	and d_year=2002
 group by i_brand, i_brand_id
 order by ext_price desc, i_brand_id
limit 100 ;

The :EXPLAIN_ANALYZE is a variable that can be used to pass in “EXPLAIN ANALYZE” text so the database will display the query plan used with other meaningful statistics. It is one of the options in the tpcds_variables.sh file to run EXPLAIN ANALYZE but by default, it is off.

You can execute this example query like this:

psql -v EXPLAIN_ANALYZE="" -f 155.tpcds.55.sql