GPLink Version 0.1.25

GPLink has a new version that fixes a situation where you may have started other gpfdist processes on your ETL server and you used a different order of parameters than what GPLink uses. The older versions would have a parsing error and this new version handles this situation. As before, if the port range of other gpfdist processes are outside the Upper and Lower port range defined by GPLink, then those gpfdist processes are ignored.

Download Version 0.1.25

Getting the filename with gpfdist

Occasionally, I see the request to get the filename added to a file read by gpfdist. Here is a way to do it!

First, create a YML file named “transform_config.yml” with the following:

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  transformation_input:
     TYPE: input 
     CONTENT: data
     COMMAND: /bin/bash transform.sh

Next, create that “transform.sh” file. This is just a simple example that gets all txt files but you can also pass in a parameter to this script. The filename in the external table gets passed to the script.

#!/bin/bash
set -e
for i in $(ls *.txt); do awk '{print FILENAME"|"$0}' $i; done

Create two test files (test_file_1.txt and test_file_2.txt).

cat test_file_1.txt 
1|foo1
2|foo2
3|foo3
4|foo4
5|foo5
6|foo6
7|foo7
8|foo8
9|foo9
10|foo10
cat test_file_2.txt 
11|foo11
12|foo12
13|foo13
14|foo14
15|foo15
16|foo16
17|foo17
18|foo18
19|foo19
20|foo20

Start gpfdist in the background.

gpfdist -p 8999 -c transform_config.yml > mylog 2>&1 < mylog &

Create the External Table but be sure to change the hostname. Note that "foo" in the LOCATION is the filename. I'm ignoring it for this example but this is how you can pass parameters to the script. You add %filename% to the YML file as the parameter to the script.

CREATE EXTERNAL TABLE ext_transform_table 
(filename text, id int, descrption text) 
LOCATION ('gpfdist://gpdbsne:8999/foo#transform=transformation_input') 
FORMAT 'text' (DELIMITER '|')

Now select from the External Table.

select * from ext_transform_table;
    filename     | id | descrption 
-----------------+----+------------
 test_file_1.txt |  1 | foo1
 test_file_1.txt |  2 | foo2
 test_file_1.txt |  3 | foo3
 test_file_1.txt |  4 | foo4
 test_file_1.txt |  5 | foo5
 test_file_1.txt |  6 | foo6
 test_file_1.txt |  7 | foo7
 test_file_1.txt |  8 | foo8
 test_file_1.txt |  9 | foo9
 test_file_1.txt | 10 | foo10
 test_file_2.txt | 11 | foo11
 test_file_2.txt | 12 | foo12
 test_file_2.txt | 13 | foo13
 test_file_2.txt | 14 | foo14
 test_file_2.txt | 15 | foo15
 test_file_2.txt | 16 | foo16
 test_file_2.txt | 17 | foo17
 test_file_2.txt | 18 | foo18
 test_file_2.txt | 19 | foo19
 test_file_2.txt | 20 | foo20
(20 rows)

HAWQ 2.0 Generally Available and Outsourcer 5.1.4

HAWQ, or commercially known as Pivotal HDB, just had a major release that I’m really excited about.

Major Features
– Based on Apache HAWQ and also includes support for Quicklz table compression plus support for PL/R, PL/Java, and pgCrypto
– Elastic runtime which means more segments (resources) can be allocated automatically based on the complexity of the query
– YARN integration
– Dynamic sizing of the cluster
– Block level storage which enables maximum parallelism
– Single HDFS directory per table which makes it easier to share and manage data
– Fault tolerance enhancements makes it easier and quicker to add or remove data nodes
– HDFS catalog cacheing
– HCatalog integration which greatly simplifies accessing Hive data
– New management interface with “hawq” commands
– Support for Ambari 2.2.2
– Plugin support for Kerberos
– Better logging for runaway query termination

Product Page
Documentation
Download

Outsourcer 5.1.4
I also have updated Outsourcer to take advantage of HAWQ 2.0/Pivotal HDB 2.0. In HAWQ 2.0/Pivotal HDB 2.0, tables should be distributed randomly in order to take advantage of many of the new features. Starting with version 5.1.4, Outsourcer will now make all tables distributed randomly when the database is HAWQ 2.0/Pivotal HDB 2.0. For Greenplum and HAWQ 1.3, the tables will still be distributed by the source’s primary key if one is found.

Documentation
Download 5.1.4
Source Code

Pivotal HDB 2.0 (Apache HAWQ) Table Distribution

Pivotal HDB version 2.0 is very close to being generally available and how table distribution works between this version and 1.3 which are worth mentioning.

Distribution
HDB is a fork of Greenplum Database which is an MPP database. Greenplum distributes or shards the data across multiple “segments” which are located on multiple “segment hosts”. The distribution is typically set by a hash of a column or set of columns.

Example:

CREATE TABLE customer 
(customer_id int,
 customer_name text,
 customer_address text)
DISTRIBUTED BY (customer_id);

In Greenplum, this would create a file in each segment in each segment host. If you made the table column oriented, the number of files increases with a file per column, per segment. Add in partitioning which again uses separate files for each partition, you end up with possibly thousands of files for a single table. This is great for an MPP database with a robust optimizer that can skip scanning files it doesn’t need to in order to execute the query in the fastest way possible.

HDB 1.3 uses the same design pattern as Greenplum but it stores the files in HDFS. Hadoop loves big files but doesn’t work optimally with lots of files which meant that you didn’t typically use column orientation and partitions were larger.

HDB 2.0 Distribution
1. There are now a dynamic number of segment processes per host. There is just a single segment directory per data node and the database will dynamically create the number of buckets as needed.

2. When you create a table with the distribution set (as shown above), the number of buckets is fixed. This is set with the GUC default_hash_table_bucket_number which sets the number of buckets per host.

3. When you create a random distribution table, the number of buckets is dynamic.

So how does this work? Take our example “customer” table above with the distribution set to (customer_id).

INSERT INTO customer SELECT i, 'company_' || i, i || ' main st' 
FROM generate_series(1,1000) AS i;

Query returned successfully: 1000 rows affected, 784 msec execution time.

Now let’s go look at the files.

--use the OID values to find the location in HDFS
SELECT oid FROM pg_database WHERE datname = 'gpadmin';

16508

SELECT c.oid
FROM pg_namespace n
JOIN pg_class c ON n.oid = c.relnamespace
WHERE n.nspname = 'public'
AND c.relname = 'customer'
AND c.relkind = 'r';

24591

So our data files are in HDFS. This is just a single node VM I’m working with one segment. It has default_hash_table_bucket_number set to 6 so HDB will create 6 buckets of data in HDFS.

hdfs dfs -ls /hawq_default/16385/16508/24591
Found 6 items
-rw-------   1 gpadmin hdfs       6776 2016-05-04 16:51 /hawq_default/16385/16508/24591/1
-rw-------   1 gpadmin hdfs       6768 2016-05-04 16:51 /hawq_default/16385/16508/24591/2
-rw-------   1 gpadmin hdfs       6688 2016-05-04 16:51 /hawq_default/16385/16508/24591/3
-rw-------   1 gpadmin hdfs       6728 2016-05-04 16:51 /hawq_default/16385/16508/24591/4
-rw-------   1 gpadmin hdfs       7600 2016-05-04 16:51 /hawq_default/16385/16508/24591/5
-rw-------   1 gpadmin hdfs       7488 2016-05-04 16:51 /hawq_default/16385/16508/24591/6

Now recreate this table with random distribution, insert the data, and look at the files.

DROP TABLE IF EXISTS customer;
CREATE TABLE customer 
(customer_id int,
 customer_name text,
 customer_address text)
DISTRIBUTED RANDOMLY;

INSERT INTO customer SELECT i, 'company_' || i, i || ' main st' 
FROM generate_series(1,1000) AS i;

SELECT c.oid
FROM pg_namespace n
JOIN pg_class c ON n.oid = c.relnamespace
WHERE n.nspname = 'public'
AND c.relname = 'customer'
AND c.relkind = 'r';

24596

[gpadmin@hdp23 ~]$ hdfs dfs -ls /hawq_default/16385/16508/24596
Found 1 items
-rw-------   1 gpadmin hdfs      41968 2016-05-04 17:02 /hawq_default/16385/16508/24596/1

It only created a single file in HDFS with random distribution.
– This is great for HDFS because there are less files for the namenode to track.
– Allows for elasticity of the cluster. Grow or shrink the cluster without having to redistribute the data.
– The optimizer has also been enhanced to dynamically set the number of buckets based on the demand of the query.

As you might be concluding right about now, RANDOM DISTRIBUTION is the recommendation for tables in HDB 2.0. You can still set your distribution to a hash of a column or columns which will use a static number of buckets but random is recommended.

If you create a table now in HDB 2.0 without setting the distribution, the default will be RANDOM.

Proof

DROP TABLE IF EXISTS customer;
CREATE TABLE customer 
(customer_id int,
 customer_name text,
 customer_address text)
 DISTRIBUTED BY (customer_id);

SELECT sub.attname
FROM pg_namespace n
JOIN pg_class c ON n.oid = c.relnamespace
LEFT JOIN 
(SELECT p.attrelid, p.attname
FROM pg_attribute p
JOIN (SELECT localoid, unnest(attrnums) AS attnum FROM gp_distribution_policy) AS g ON g.localoid = p.attrelid AND g.attnum = p.attnum) AS sub
ON c.oid = sub.attrelid 
WHERE n.nspname = 'public'
AND c.relname = 'customer'
AND c.relkind = 'r';

"customer_id"

Now, recreate the table without setting the distribution key.

DROP TABLE IF EXISTS customer;
CREATE TABLE customer 
(customer_id int,
 customer_name text,
 customer_address text);

SELECT sub.attname
FROM pg_namespace n
JOIN pg_class c ON n.oid = c.relnamespace
LEFT JOIN 
(SELECT p.attrelid, p.attname
FROM pg_attribute p
JOIN (SELECT localoid, unnest(attrnums) AS attnum FROM gp_distribution_policy) AS g ON g.localoid = p.attrelid AND g.attnum = p.attnum) AS sub
ON c.oid = sub.attrelid 
WHERE n.nspname = 'public'
AND c.relname = 'customer'
AND c.relkind = 'r';

--no data

Outsourcer 5.1.1

5.1.1 enhances Append jobs to use Big Integer in addition to Integer data types. Additionally, you can now use Timestamp data types.

Be sure to always use an ordered sequence in Oracle and an ordered identity in SQL Server when using an Append job. Timestamp is useful when you are using the system timestamp in Oracle or SQL Server to append new data.

Download
Documentation
Source Code

HAWQ Demystified

The genesis of the best SQL engine in Hadoop is not an overnight, “me too” product. It seems that it wasn’t too long ago the Hadoop vendors all but wrote off SQL but the demand just hasn’t gone away.

HAWQ is the result of many, many years of work leveraging open source software as well as contributing back to the open source community. I think a short history lesson is in order to fully understand how this product came to be.

greenplum
Greenplum Database began life in 2003. The founders used open source PostgreSQL and released a commercial product soon after.

bizgres
Bizgres, an open source version of Greenplum, was released in 2005. Very early on, the founders of Greenplum embraced contributing back to the open source community.


Madlib was released in 2010 as an open source project which later became an Apache Incubator project.

EMC
Greenplum was acquired by EMC in 2010 and almost immediately, EMC invested heavily into Hadoop. The Greenplum division was agile like a small startup company but with the deep pockets of a multi-billion dollar company.

GPHD
Greenplum released a Hadoop distribution in 2011 and integration between Greenplum Database and HDFS got more robust with the introduction of “gphdfs”. Greenplum supported External Tables to read/write data in parallel to HDFS from several different distributions.

HAWQ
HAWQ, a fork of Greenplum Database, was released in 2013. HAWQ was immediately extremely performant and compliant with the newest SQL syntax. HAWQ borrowed from the 10 years experience of developing Greenplum to provide a robust optimizer designed for HDFS.

pivotal155px
2013 also saw Pivotal become a company. EMC contributed Greenplum Database, VMWare contributed Gemfire and Cloud Foundry, and GE contributed capital as an active partner. Paul Maritiz became the CEO and the dedication to fully embrace open source became an integral part of the corporate culture.

During the last three years, HAWQ has become an Apache Incubator Project. The Pivotal product is now a rather boring name of “Pivotal HDB” while HAWQ is the name of the Apache project.

Pivotal also made Greenplum and Geode (Gemfire is the commercial product name) open source projects too. Clearly, Pivotal has embraced open source with probably more committers than those other “open source” data companies.

So what now? What is happening in 2016? Well, Pivotal is about to release Pivotal HDB (HAWQ) 2.0. I’ve been testing this product for months on various platforms and I keep getting amazed by the performance and ease of use.

HAWQ 2.0 embraces Hadoop fully. I believe the two biggest features are elasticity and performance. HAWQ now supports elasticity for growing or shrinking clusters without having to redistribute the data. The performance is also much improved as it better utilizes HDFS and YARN.

Pivotal HDB is certified to run on Hortonworks HDP with plans on becoming a first class citizen of the Open Data Platform (ODPi).

So you may be asking, “is it fast?” and the answer is yes! I haven’t found a SQL engine that is faster and I’ve been doing competitive analysis for months. The other question you may ask is, “can I run my SQL?” and the answer is yes! A major competitor in the SQL on Hadoop landscape requires more tweaking of SQL just to get the SQL to execute and more tuning to get decent performance.

That “other SQL on Hadoop” product can’t do the following things, as well as many more, that HAWQ can.

– Can’t handle a comment line at the end of SQL file (I found that rather strange)
– Won’t do partition elimination through a joined table (e.g. date dimension)
– Can’t get number of milliseconds when subtracting from a date. Only gets seconds.
– Has “interval” in SQL dialect but doesn’t have interval as a type.
– No time data type.
– Concatenating strings doesn’t use || or +. You must use concat() function.
– Doesn’t support intersect or except.
– Doesn’t support subqueries that return more than one row.
– Doesn’t support correlated subqueries.
– Doesn’t support group by rollup.
– Doesn’t support subqueries in having statement.
– Subqueries not supported in select list.

HAWQ is the real deal. It is an Apache project, going to be part of ODPi, faster than everyone else, integrated with Apache Ambari, certified with Hortonworks, and the most mature SQL engine for Hadoop.