Loading data with External Tables and gpfdist

Straight from the Greenplum Admin Guide:

gpfdist is Greenplum’s parallel file distribution program. It is used by readable external tables and gpload to serve external table files to all Greenplum Database segments in parallel. It is used by writable external tables to accept output streams from Greenplum Database segments in parallel and write them out to a file.

In this post, I’m going to demonstrate how to use a readable external table to load data that uses gpfdist. This makes it easy for Greenplum users to load massive amounts of data by just writing SQL!

First, start gpfdist as a background process.

gpfdist -d /Users/gpadmin -p 8080 &

Create my Table in Greenplum.

CREATE TABLE foo
(bar_id int not null,
 bar text,
 bar_description text)
DISTRIBUTED BY (bar_id);

Now create my External Table.

CREATE EXTERNAL TABLE ext_foo
(LIKE foo) LOCATION ('gpfdist://localhost:8080/demo/foo.txt')
FORMAT 'TEXT' (DELIMITER AS '|' NULL AS 'null');
  • I could have spelled out all of the columns but I used a shortcut by using “LIKE foo”
  • The location indicates it is using gpfdist and the host is local with port 8080
  • My file is located /Users/gpadmin/demo/foo.txt but gpfdist is serving /Users/gpadmin so I need to only specify the demo subdirectory and then the file name
  • You can pick TEXT of CSV format but TEXT will execute faster
  • I used pipes as my delimiter and spelled out null for my null values
Insert the data.
INSERT INTO foo SELECT * FROM ext_foo;

You are done!

Other tips:

  • When creating the External Table, you can also specify an error_table (created dynamically if not found) with a reject limit.
LOG ERRORS INTO error_table SEGMENT REJECT LIMIT count ROWS
  • If you have a really large file to load, you can split the file into multiple files and then specify multiple locations in the External Table.  You can even specify different hosts for each file with multiple gpfdists running at once.

And remember, that inserting data this way pushes the data directly to the segments.  It bypasses the MASTER server so it scales much better than other database’s loading utilities.  Greenplum can load 10TB/Hour for a full rack DCA when using gpfdist!

7 thoughts on “Loading data with External Tables and gpfdist

    1. Jon Post author

      Outsourcer is a tool I wrote that allows you to load data with external tables from a remote SQL Server or Oracle database. There are 3rd party programs that do it too like Attunity.

      Reply
  1. Zuendi

    I get the idea of creating the external table. But how do you fill it with data?
    And how does this file look like? Can you post an example?

    Thx

    Reply
    1. Jon Post author

      A simple example could be a pipe delimited file named example.txt like this on the Master host:

      1|jon|roberts
      2|john|smith
      

      Next, create an external table like this:

      CREATE EXTERNAL TABLE example
      (id int, fname text, lname text)
      LOCATION ('gpfdist://bigmac:8080/example.txt') FORMAT 'text' (DELIMITER '|');
      

      Start a gpfdist process in the same directory as example.txt like this:

      gpfdist -p 8080
      

      Lastly, SELECT from the External Table like this:

      SELECT * FROM example;
      
      Reply
    1. Jon Post author

      You don’t load “real time” data to Greenplum. Greenplum is an MPP database that specializes in analytics and business intelligence. You can micro-batch data but that is typically limited to insert-only log data. For example, you may load stock transactions every 5 minutes or sensor data every 5 minutes. Most people load data to Greenplum on a daily basis if not weekly or even monthly.

      A lot of customers will keep 1 or 2 years worth of data in their operational system and then keep much more history in Greenplum. This makes refreshing the current data in Greenplum rather trivial because it can be done so quickly. It also keeps your OLTP small and nimble.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *