Straight from the Greenplum Admin Guide:
gpfdist is Greenplum’s parallel file distribution program. It is used by readable external tables and gpload to serve external table files to all Greenplum Database segments in parallel. It is used by writable external tables to accept output streams from Greenplum Database segments in parallel and write them out to a file.
In this post, I’m going to demonstrate how to use a readable external table to load data that uses gpfdist. This makes it easy for Greenplum users to load massive amounts of data by just writing SQL!
First, start gpfdist as a background process.
gpfdist -d /Users/gpadmin -p 8080 &
Create my Table in Greenplum.
CREATE TABLE foo (bar_id int not null, bar text, bar_description text) DISTRIBUTED BY (bar_id);
Now create my External Table.
CREATE EXTERNAL TABLE ext_foo (LIKE foo) LOCATION ('gpfdist://localhost:8080/demo/foo.txt') FORMAT 'TEXT' (DELIMITER AS '|' NULL AS 'null');
- I could have spelled out all of the columns but I used a shortcut by using “LIKE foo”
- The location indicates it is using gpfdist and the host is local with port 8080
- My file is located /Users/gpadmin/demo/foo.txt but gpfdist is serving /Users/gpadmin so I need to only specify the demo subdirectory and then the file name
- You can pick TEXT of CSV format but TEXT will execute faster
- I used pipes as my delimiter and spelled out null for my null values
INSERT INTO foo SELECT * FROM ext_foo;
You are done!
- When creating the External Table, you can also specify an error_table (created dynamically if not found) with a reject limit.
LOG ERRORS INTO error_table SEGMENT REJECT LIMIT count ROWS
- If you have a really large file to load, you can split the file into multiple files and then specify multiple locations in the External Table. You can even specify different hosts for each file with multiple gpfdists running at once.
And remember, that inserting data this way pushes the data directly to the segments. It bypasses the MASTER server so it scales much better than other database’s loading utilities. Greenplum can load 10TB/Hour for a full rack DCA when using gpfdist!