Question: I’m unable to login with Outsourcer and I get this error message, “Failed to login. Must use a valid Greenplum database account that is a SuperUser.”
Answer: Outsourcer uses the database, not the operating system, to authenticate users. The default installation of HAWQ and Greenplum doesn’t allow external connections so you will need to edit your pg_hba.conf file on the MASTER server to allow for external connections. It also doesn’t have the database password set so you’ll also need to set that.
Log in as gpadmin to the MASTER host and execute:
echo "host all all 0.0.0.0/0 md5" >> /data/master/gpseg-1/pg_hba.conf gpstop -u psql -c "alter user gpadmin password 'changeme'"
This will allow you to connect as gpadmin, or any other superuser account in the database, to Outsourcer.
The entry to the pg_hba.conf file is saying that any user connecting to any database on any IP address will be prompted for their database password which is encrypted.
The problem could also be you are trying to use a database account that is not a superuser. Outsourcer has the ability to create and drop tables as well as execute any command you want on the hosts as gpadmin so Outsourcer restricts the application to superusers only.
Question: Why weren’t all of my Oracle and SQL Server columns were transferred?
Answer: Outsourcer is designed for structured data so some data types are excluded. There are other techniques beyond the scope of Outsourcer for handling unstructured data.
Oracle columns excluded:
- LONG RAW
SQL Server columns excluded:
Workaround: If you can convert your data to a supported data type like VARCHAR, then do this in a database VIEW. With Outsourcer, create a job that uses the VIEW instead of the table and the column will be transferred.
Question: How do I execute a custom transformation step after a table is loaded?
Answer: Outsourcer has a “transform” refresh_type in which you can provide SQL to execute in Greenplum.
Question: How do synchronize a change in my source system with Greenplum?
Answer: If you add or drop a column in your source system and now jobs are failing in the queue, the simple fix is to drop the table in Greenplum and re-run the job. Outsourcer will not find the table so it will recreate it for you.
For ‘replication’ jobs, you can set ‘snapshot’ = true which recreates the target table in Greenplum. It also recreates the triggers in the source, recreates the change data capture table in the source, the archive table in Greenplum, and the stage table in Greenplum. After all of the objects are recreated, Outsourcer will repopulate the entire table in Greenplum. This is a similar process by other database tools that replicate data.
Question: How do I partition a table created by Outsourcer?
Answer: A table can only be partitioned at creation time using the CREATE TABLE command. One strategy is to follow these steps:
- Initially create the table in Greenplum with Outsourcer.
- Rename the table in Greenplum to _old.
- Create a new table in Greenplum with the old table name but now partitioned.
- Insert the data from _old to the new partitioned table.
- Analyze your table
Outsourcer will see that the partitioned table exists and not try to recreate it so future job executions in the queue will load data into your partitioned table.
Question: How do I change the distribution?
Answer: Outsourcer distributes tables based on the primary key in the source. If a primary key is not found, then the table is distributed randomly. If random distribution or the primary key column as the distribution is not optimal, you can easily execute a DDL statement to change the distribution of the table.
ALTER TABLE t SET DISTRIBUTED BY (id);
Question: How fast is the load?
Answer: It depends mostly on the performance of the source system.
From a recent POC using Outsourcer:
- 80 jobs completed in 35 minutes
- Largest three tables: 126 million, 80 million, and 44 million
- 10 tables with over a million records
- 365 million rows inserted
- ¼ rack DCA
- An entire SQL Server database was transferred
The bottleneck is typically your source system so these times could easily be better if the source SQL Server database was faster.
Question: Will Outsourcer cause blocking locks in my source system?
For SQL Server, I execute the query as READ COMMITTED when the database is in “READ COMMITTED SNAPSHOT ISOLATION” mode. This feature in SQL Server 2005 and newer prevents readers from blocking writers. The database uses TempDB to store historical versions of the data to create a read consistent view of the data without blocking locks.
If your SQL Server database does not have this feature turned on, then I use a dirty read (READ UNCOMMITTED). This isn’t the ideal solution but it prevents Outsourcer from blocking changes in your database. I highly recommend using READ COMMITTED SNAPSHOT ISOLATION for read consistency without dirty reads.
For Oracle, I rely on the default behavior of the database for read consistency. The only real problem that can occur from this is the Oracle error, “SNAPSHOT TOO OLD”. This happens if there are lots of changes to the database while Outsourcer is executing and you don’t have enough Oracle Rollback.