FAQ

Question: I’m unable to login with Outsourcer and I get this error message, “Failed to login. Must use a valid Greenplum database account that is a SuperUser.”

Answer: Outsourcer uses the database, not the operating system, to authenticate users. The default installation of HAWQ and Greenplum doesn’t allow external connections so you will need to edit your pg_hba.conf file on the MASTER server to allow for external connections. It also doesn’t have the database password set so you’ll also need to set that.

Log in as gpadmin to the MASTER host and execute:

echo "host all all 0.0.0.0/0 md5" >> /data/master/gpseg-1/pg_hba.conf
gpstop -u
psql -c "alter user gpadmin password 'changeme'"

This will allow you to connect as gpadmin, or any other superuser account in the database, to Outsourcer.

The entry to the pg_hba.conf file is saying that any user connecting to any database on any IP address will be prompted for their database password which is encrypted.

The problem could also be you are trying to use a database account that is not a superuser. Outsourcer has the ability to create and drop tables as well as execute any command you want on the hosts as gpadmin so Outsourcer restricts the application to superusers only.

Question: Why weren’t all of my Oracle and SQL Server columns were transferred?

Answer:  Outsourcer is designed for structured data so some data types are excluded.  There are other techniques beyond the scope of Outsourcer for handling unstructured data.

Oracle columns excluded:

  • BLOB
  • BFILE
  • RAW
  • LONG RAW
  • UROWID
  • MLSLABEL
  • XMLTYPE

SQL Server columns excluded:

  • binary
  • image
  • timestamp
  • xml
  • varbinary
  • text
  • ntext
  • sql_variant

Workaround:  If you can convert your data to a supported data type like VARCHAR, then do this in a database VIEW.  With Outsourcer, create a job that uses the VIEW instead of the table and the column will be transferred.


Question:  How do I execute a custom transformation step after a table is loaded?

Answer:  Outsourcer has a “transform” refresh_type in which you can provide SQL to execute in Greenplum.


Question:  How do synchronize a change in my source system with Greenplum?

Answer:  If you add or drop a column in your source system and now jobs are failing in the queue, the simple fix is to drop the table in Greenplum and re-run the job.  Outsourcer will not find the table so it will recreate it for you.

For ‘replication’ jobs, you can set ‘snapshot’ = true which recreates the target table in Greenplum.  It also recreates the triggers in the source, recreates the change data capture table in the source, the archive table in Greenplum, and the stage table in Greenplum.  After all of the objects are recreated, Outsourcer will repopulate the entire table in Greenplum.  This is a similar process by other database tools that replicate data.


Question:  How do I partition a table created by Outsourcer?

Answer:  A table can only be partitioned at creation time using the CREATE TABLE command.  One strategy is to follow these steps:

  1. Initially create the table in Greenplum with Outsourcer.
  2. Rename the table in Greenplum to _old.
  3. Create a new table in Greenplum with the old table name but now partitioned.
  4. Insert the data from  _old to the new partitioned table.
  5. Analyze your table

Outsourcer will see that the partitioned table exists and not try to recreate it so future job executions in the queue will load data into your partitioned table.


Question:  How do I change the distribution?

Answer:   Outsourcer distributes tables based on the primary key in the source.  If a primary key is not found, then the table is distributed randomly.  If random distribution or the primary key column as the distribution is not optimal, you can easily execute a DDL statement to change the distribution of the table.

ALTER TABLE t SET DISTRIBUTED BY (id);


Question:  How fast is the load?

Answer:  It depends mostly on the performance of the source system.

From a recent POC using Outsourcer:

  • 80 jobs completed in 35 minutes
  • Largest three tables: 126 million, 80 million, and 44 million
  • 10 tables with over a million records
  • 365 million rows inserted
  • ¼ rack DCA
  • An entire SQL Server database was transferred

The bottleneck is typically your source system so these times could easily be better if the source SQL Server database was faster.


Question:  Will Outsourcer cause blocking locks in my source system?

Answer:  No.

For SQL Server, I execute the query as READ COMMITTED when the database is in “READ COMMITTED SNAPSHOT ISOLATION” mode.  This feature in SQL Server 2005 and newer prevents readers from blocking writers.  The database uses TempDB to store historical versions of the data to create a read consistent view of the data without blocking locks.

If your SQL Server database does not have this feature turned on, then I use a dirty read (READ UNCOMMITTED).  This isn’t the ideal solution but it prevents Outsourcer from blocking changes in your database.  I highly recommend using READ COMMITTED SNAPSHOT ISOLATION for read consistency without dirty reads.

For Oracle, I rely on the default behavior of the database for read consistency.  The only real problem that can occur from this is the Oracle error, “SNAPSHOT TOO OLD”.  This happens if there are lots of changes to the database while Outsourcer is executing and you don’t have enough Oracle Rollback.

10 thoughts on “FAQ

    1. Jon Post author

      Outsourcer leverages features that are unique to Greenplum and are not available in PostgreSQL which would make the refactor not possible. Namely the External Web Table feature that enables a table to generate results from executing a program is a Greenplum only feature.

      Reply
    1. Jon Post author

      It isn’t a direct comparison of the load time from flat files to what Outsourcer can do. Outsourcer is doing the Extract, Transfer, and Load. The bottleneck is usually the Source database (SQL Server or Oracle) or the network between the Source and the Target(Greenplum).

      To compare the “load times”, try it. Extract data from SQL Server with something like sqlcmd or bcp, transfer it to a server like smdw, use gpload to load the data, and then review and fix the errors. Then load that same table with Outsourcer. I bet Outsourcer will be faster.

      Reply
  1. Venkadesh R

    I tried to create a greenplum function with default value in parameters and it throwing error. my code is below

    CREATE OR REPLACE FUNCTION fn_send_email(p_char_subj VARCHAR, p_char_msg VARCHAR, p_char_message_to VARCHAR DEFAULT ‘test’)
    RETURNS void AS
    $BODY$

    …………

    $BODY$
    LANGUAGE plpgsql VOLATILE;

    Error Msg: ERROR: 42601: syntax error at or near “DEFAULT”

    Reply
    1. Jon Post author

      Greenplum doesn’t directly support default values for function parameters but it can be done with overloading the function.

      This is the function that sends the email.

      CREATE OR REPLACE FUNCTION fn_send_email(p_char_subj VARCHAR, p_char_msg VARCHAR, p_char_message_to VARCHAR)
      RETURNS void AS
      $BODY$
      BEGIN
      /* body of function that sends mail */
      END;
      $BODY$
      LANGUAGE plpgsql VOLATILE;
      

      And this function has the same name but doesn’t include the p_char_message_to parameter. It gets the default value of ‘test’ and then it calls the first function.

      
      CREATE OR REPLACE FUNCTION fn_send_email(p_char_subj VARCHAR, p_char_msg VARCHAR)
      RETURNS void AS
      $BODY$
      DECLARE
              v_char_message_to VARCHAR := 'test';
      BEGIN
              perform fn_send_email(p_char_subj, p_char_msg, v_char_message_to);
      
      END;
      $BODY$
      LANGUAGE plpgsql VOLATILE;
      

      Alternatively, you could have one function and just pass in NULL and have a coalesce statement on the parameter to set it to ‘test’ if the parameter comes in NULL.

      Reply
    1. Jon Post author

      Defining a port for SQL Server is pretty rare but it can be done. Usually, a named instance is used instead of defining a different port. The named instance handles the different port number automatically. But to answer your question, you can add “:portNumber” to the end of the server name.

      Reply
  2. Dr David King

    I am looking for a solution to keep greenplum databases in sync between datacenters. Will outsourcer allow me to specify greenplum as the source and the destination in a different datacenter? Is there another way ? Perhaps kafka/Nifi?

    Reply
    1. Jon Post author

      Outsourcer is not designed to keep two different Greenplum databases in sync. This is typically done with a dual ETL strategy or gptransfer. I’ve seen one installation where gpcrondump was used to backup the database to DataDomain and then it was restored from DataDomain to a different Greenplum cluster.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *