Monthly Archives: March 2015

External Table “TRANSFORM” Option

In the Greenplum Administrator Guide, there is a section that covers loading XML data with gpload. It also mentions that you can create the External Table yourself. This demo will show you how to create an External Table that utilizes this feature so that you can execute any script you want on your ETL server. This demo also works for HAWQ.

XML Parsing is done sequentially from the top to the bottom of a file and the TRANSFORM option was built with this in mind. The gpfdist process will execute a script for you and the output is read by the segments in parallel. Instead of parsing XML with a script, this demo will execute a Unix command in a script to show you how you can leverage this feature to execute virtually any command you want on a remote server.

ETL Server
/data1/demo5/demo5.yml

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  transform_demo5:
     TYPE: input 
     CONTENT: data
     COMMAND: /bin/bash get_df.sh

As you can see, I created a transform named “transform_demo5” that executes the script “get_df.sh”. So let’s look at get_df.sh.
/data1/demo5/get_df.sh

df -k | awk '{print $1"|"$2"|"$3"|"$4"|"$5"|"$6}' | tail -n +2

This simple command executes the Unix df command and converts that to a pipe delimited file. Executing the command outputs this:

[pivhdsne:demo5]$ df -k | awk '{print $1"|"$2"|"$3"|"$4"|"$5"|"$6}' | tail -n +2
/dev/sda3|47472560|12080036|32981056|27%|/
tmpfs|4030752|0|4030752|0%|/dev/shm
/dev/sda1|99150|48764|45266|52%|/boot

Now start a gpfdist process that uses a configuration file. The “-c” option isn’t well documented with gpfdist but it is mentioned with the XML parsing.

gpfdist -p 8999 -c /data1/demo5/demo5.yml 2>&1 > demo5.log &

Now create an External Table. Notice the format of using #transform=transform_demo5. The filename of “foo” is ignored but you can reference the filename as a parameter to your transform scripts.

/data1/demo5/get_df.sql

CREATE EXTERNAL TABLE get_df 
(Filesystem text,
 K_blocks int,
 Used int,
 Available int,
 Used_percentage text,
 Mounted_on text)
LOCATION ('gpfdist://pivhdsne:8999/foo#transform=transform_demo5')
FORMAT 'TEXT' (DELIMITER '|');

Now create the table in your database server (Hawq or Greenplum).

psql -f get_df.sql -h gpdbvm43 
Password: 
Timing is on.
CREATE EXTERNAL TABLE
Time: 238.788 ms

Now let’s see the output.

[pivhdsne:demo5]$ psql -h gpdbvm43
Password: 
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from get_df;
 filesystem | k_blocks |   used   | available | used_percentage | mounted_on 
------------+----------+----------+-----------+-----------------+------------
 /dev/sda3  | 47472560 | 12104924 |  32956168 | 27%             | /
 tmpfs      |  4030752 |        0 |   4030752 | 0%              | /dev/shm
 /dev/sda1  |    99150 |    48764 |     45266 | 52%             | /boot
(3 rows)

Time: 32.149 ms

This is yet another feature of Greenplum and Hawq that gives you more flexibility in working with data. I’ve seen this feature used to move files, change permissions, get data from running programs, and of course parsing XML. Enjoy!

Outsourcer 5.0

Outsourcer 5.0 is now available. This version addresses most of the requests that I have received since the release of 4.0. I’m very excited to complete this work!

Enhancments
Outsourcer now uses External Tables with gpfdist rather than External Web Tables. This is a big change and allows you to install Outsourcer on any host that is accessible to the Greenplum Segment Hosts / HAWQ Data Nodes. It also reduces the load on the Master server.

Installation does more checking and automation.

Root is no longer needed for installation.

start_all and stop_all scripts are provided to ease starting and stopping Outsourcer.

A new set of screens for creating your own Custom External Table with a pre-defined SQL Statement.

You can install multiple copies of Outsourcer to load data to multiple databases.

Download
Documentation
Source Code