Getting the filename with gpfdist

Occasionally, I see the request to get the filename added to a file read by gpfdist. Here is a way to do it!

First, create a YML file named “transform_config.yml” with the following:

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  transformation_input:
     TYPE: input 
     CONTENT: data
     COMMAND: /bin/bash transform.sh

Next, create that “transform.sh” file. This is just a simple example that gets all txt files but you can also pass in a parameter to this script. The filename in the external table gets passed to the script.

#!/bin/bash
set -e
for i in $(ls *.txt); do awk '{print FILENAME"|"$0}' $i; done

Create two test files (test_file_1.txt and test_file_2.txt).

cat test_file_1.txt 
1|foo1
2|foo2
3|foo3
4|foo4
5|foo5
6|foo6
7|foo7
8|foo8
9|foo9
10|foo10
cat test_file_2.txt 
11|foo11
12|foo12
13|foo13
14|foo14
15|foo15
16|foo16
17|foo17
18|foo18
19|foo19
20|foo20

Start gpfdist in the background.

gpfdist -p 8999 -c transform_config.yml > mylog 2>&1 < mylog &

Create the External Table but be sure to change the hostname. Note that "foo" in the LOCATION is the filename. I'm ignoring it for this example but this is how you can pass parameters to the script. You add %filename% to the YML file as the parameter to the script.

CREATE EXTERNAL TABLE ext_transform_table 
(filename text, id int, descrption text) 
LOCATION ('gpfdist://gpdbsne:8999/foo#transform=transformation_input') 
FORMAT 'text' (DELIMITER '|')

Now select from the External Table.

select * from ext_transform_table;
    filename     | id | descrption 
-----------------+----+------------
 test_file_1.txt |  1 | foo1
 test_file_1.txt |  2 | foo2
 test_file_1.txt |  3 | foo3
 test_file_1.txt |  4 | foo4
 test_file_1.txt |  5 | foo5
 test_file_1.txt |  6 | foo6
 test_file_1.txt |  7 | foo7
 test_file_1.txt |  8 | foo8
 test_file_1.txt |  9 | foo9
 test_file_1.txt | 10 | foo10
 test_file_2.txt | 11 | foo11
 test_file_2.txt | 12 | foo12
 test_file_2.txt | 13 | foo13
 test_file_2.txt | 14 | foo14
 test_file_2.txt | 15 | foo15
 test_file_2.txt | 16 | foo16
 test_file_2.txt | 17 | foo17
 test_file_2.txt | 18 | foo18
 test_file_2.txt | 19 | foo19
 test_file_2.txt | 20 | foo20
(20 rows)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.