Category Archives: Outsourcer

Outsourcer 5.2.0 and Pivotal HDB 2.0.1

Pivotal HDB 2.0.1 (based on Apache HAWQ)
The newest release of Pivotal HDB is fantastic! It adds new features and resolves some issues as well. Here are the release notes.

JDBC Bug
One resolved issue affected Outsourcer which is a JDBC problem and is documented here: HAWQ-738. Pivotal HDB 2.0.0 was released in May and I found that Outsourcer suffered from stability issues under load and was generally slower navigating in the UI than HDB 1.3 or Greenplum Database. The issue was quickly resolved in that same month but the fix wasn’t publicly available until October of 2016 with this new release.

Quicklz
Quicklz is a compression library that uses GPL licensing. It is bundled with Greenplum Database and Pivotal HDB 1.3 and 2.0.0. Starting with Pivotal HDB 2.0.1, Quicklz has been removed. Why? Because of that GPL license and HAWQ is an Apache project.

In HDB 1.3 and 2.0.0, the guidance was to use Quicklz compression for row oriented tables but starting with 2.0.1, you should use Snappy compression. It is an easy change too:

CREATE TABLE mytable (id INT, fname TEXT) 
WITH (APPENDONLY=TRUE, COMPRESSTYPE=SNAPPY) 
DISTRIBUTED RANDOMLY;

Note: When upgrading to 2.0.1, be sure to first change your Quicklz compressed tables to either Zlib, Parquet with Snappy, or no compression at all. Then upgrade and then you can change back to row orientation and use Snappy compression. More details are in this section of the release notes.

Future Releases
Future releases of Pivotal HDB should come quicker because the licensing hurdle is now complete. This means developers can focus on enhancements and fixes rather than licensing.

Outsourcer 5.2.0
This new release officially supports Pivotal HDB 2.0.1. It makes the compression for row oriented tables to now use Snappy instead of Quicklz. If you are using Pivotal HDB 2.0.0 or Greenplum Database, Outsourcer will still use Quicklz compression.

Please consider upgrading to Pivotal HDB 2.0.0 to 2.0.1 especially if you are using Outsourcer. When I test Outsourcer with Pivotal HDB 2.0, I use build 22425 rather than build 22126 which is what you can download from Pivotal’s site. 22126 has the JDBC bug while 22425 and new builds do not. And when you upgrade to 2.0.1, also upgrade to Outsourcer 5.2.0.

Download 5.2.0!
Documentation
Source Code

HAWQ 2.0 Generally Available and Outsourcer 5.1.4

HAWQ, or commercially known as Pivotal HDB, just had a major release that I’m really excited about.

Major Features
– Based on Apache HAWQ and also includes support for Quicklz table compression plus support for PL/R, PL/Java, and pgCrypto
– Elastic runtime which means more segments (resources) can be allocated automatically based on the complexity of the query
– YARN integration
– Dynamic sizing of the cluster
– Block level storage which enables maximum parallelism
– Single HDFS directory per table which makes it easier to share and manage data
– Fault tolerance enhancements makes it easier and quicker to add or remove data nodes
– HDFS catalog cacheing
– HCatalog integration which greatly simplifies accessing Hive data
– New management interface with “hawq” commands
– Support for Ambari 2.2.2
– Plugin support for Kerberos
– Better logging for runaway query termination

Product Page
Documentation
Download

Outsourcer 5.1.4
I also have updated Outsourcer to take advantage of HAWQ 2.0/Pivotal HDB 2.0. In HAWQ 2.0/Pivotal HDB 2.0, tables should be distributed randomly in order to take advantage of many of the new features. Starting with version 5.1.4, Outsourcer will now make all tables distributed randomly when the database is HAWQ 2.0/Pivotal HDB 2.0. For Greenplum and HAWQ 1.3, the tables will still be distributed by the source’s primary key if one is found.

Documentation
Download 5.1.4
Source Code

Outsourcer 5.1.1

5.1.1 enhances Append jobs to use Big Integer in addition to Integer data types. Additionally, you can now use Timestamp data types.

Be sure to always use an ordered sequence in Oracle and an ordered identity in SQL Server when using an Append job. Timestamp is useful when you are using the system timestamp in Oracle or SQL Server to append new data.

Download
Documentation
Source Code

Outsourcer 5.0.9

5.0.9 adds support for HAWQ 2.0.

I’m looking for feedback on how best to handle table distribution for tables created in HAWQ 2.0. Outsourcer automatically sets distribution keys based on the source primary keys so it always uses hash when there is a PK found. HAWQ 2.0 supports hash and random distribution as before but random distribution allows a cluster to be resized without having to redistribute the data.

– Should I keep the code as-is?
– Should I set a global environment variable to allow you to set all tables to be created random or not?
– Should I update nearly every UI screen as well as the job and queue tables to have a random boolean that is only used for HAWQ 2.0?

Download
Documentation
Source Code

Outsourcer 5.0.5

I have made improvements around the starting and stopping of gpfdist processes. It now handles the situation where a gpfdist process doesn’t start properly and instead of failing, it will pick the next available port. I also completely separated osstart and uistart. The User Interface does not need to be running in order for Outsourcer to load tables in the Queue or select from Custom SQL Tables.

This version is a release candidate as testing has shown it is very stable.

Download
Documentation
Source Code

Outsourcer 5.0.4

Here is another new release of Outsourcer 5. I will likely have another release by the end of the week to further clean up the code so be on the lookout for that. I will update the documentation for release 5.0.5 as well.

I am making changes to Outsourcer 5 due to the move from External Web Table to External Tables using gpfdist. In my testing with 4 concurrent jobs, gpfdist seemed to work fine with only one process handling the load. After stress testing, I have learned that gpfdist is not multi-threaded and a single process can not handle the load of multiple jobs.

I have resolved this issue in 5.0.3 by starting and stopping gpfdist for every job. Upon further stress testing, I identified a potential problem when gpfdist fails to start properly. This would cause multiple jobs to fail in a cascading fashion.

5.0.4 resolves the scenario when gpfdist fails to start. The single job will fail and not affect other jobs.

This release also decouples the User Interface from Outsourcer processing jobs in the Queue. You can run Outsourcer without the User Interface running at all.

I also identified a scenario where Outsourcer 5 is installed on the Master was causing problems. The installer was expecting the installation to happen on the Standby-Master or on a dedicated ETL server so the installer would create a new .pgpass file and it would also backup an existing one. This would remove the gpmon entry for and prevent Command Center from working correctly. 5.0.4 now keeps the gpmon entry, if found, in the .pgpass file. If you have a Standby-Master or an ETL host, install Outsourcer 5 there instead.

And lastly, if you just want a stable release of Outsourcer that isn’t changing frequently, use Outsourcer 4.1.6 which hasn’t changed since January and uses External Web Tables.

5.0.4
Source Code

Outsourcer 5.0

Outsourcer 5.0 is now available. This version addresses most of the requests that I have received since the release of 4.0. I’m very excited to complete this work!

Enhancments
Outsourcer now uses External Tables with gpfdist rather than External Web Tables. This is a big change and allows you to install Outsourcer on any host that is accessible to the Greenplum Segment Hosts / HAWQ Data Nodes. It also reduces the load on the Master server.

Installation does more checking and automation.

Root is no longer needed for installation.

start_all and stop_all scripts are provided to ease starting and stopping Outsourcer.

A new set of screens for creating your own Custom External Table with a pre-defined SQL Statement.

You can install multiple copies of Outsourcer to load data to multiple databases.

Download
Documentation
Source Code

Outsourcer 4.1.5 released

Bugs Fixed
1. Error messages raised from Oracle and SQL Server could include special characters which would cause the thread in Outsourcer to die and the status remain labeled as “processing”. Error messages no longer have escape and quote characters to prevent this problem.
2. On new installations, the permissions for the Oracle and SQL Server Jar files would remain owned by root. This has now been corrected to be the admin user (gpadmin).
3. On the “Sources” screen, the label for Append-Only now is correct for Greenplum Database 4.3 and reflects that these are really “Append-Optimized” tables.

Enhancements
1. Added External Tables to enable starting and stopping the UI rather than having to ssh to the Master host.
2. “Queue” screen now provides the ability to cancel processing jobs.
3. Stopping Outsourcer Queue Daemon (Environment Screen), now will cancel currently running Jobs in the Queue.
4. Starting Outsourcer Queue Daemon (Environment Screen), now will make sure orphaned jobs in the Queue will be marked as Failed and if any are currently running, it will cancel these jobs too.

Download
Documenation
Github

Outsourcer 4.1.0 Released!

Outsourcer now supports HAWQ! If you aren’t familiar with HAWQ, it is basically Greenplum Database that uses HDFS to store data. This makes Outsourcer even more useful as you can load data into Hadoop from SQL Server and Oracle by clicking through an easy to use web interface.

There are many more new features like the ability to specify Append-Optimized tables, compression, or column orientation. Check it out!

Zip File
Documentation
GitHub Source

Install Video:

Outsourcer 4.0.2 Released

Outsourcer 4.0.2 is now available for download.

New to 4.0.2

  • Corrected Open Source license file. It is now using a BSD license and the NanoHTTPD license (web server).
  • Corrected install.sql file that incorrectly had $BODY$ for one function in the declaration.
  • Corrected cookie conflict with Command Center.
  • Reduced the number of database calls when using the Web interface.
  • Removed updates from the os.sessions table for the Web interface.
  • Links

  • 4.0.2 Zip
  • 4.0.2 Administrator’s Guide
  • 4.0.2 Source Code
  • Outsourcer 4.0.1

    Outsourcer 4.0.1 is mainly bug fixes and enhancements for Oracle.

    Changed the default Oracle Fetch Size from 40,000 to 2000. 2000 seems to be the point of diminishing returns for any values greater. The Oracle default is only 10, which makes exporting data very slow. The higher the value, the more memory is needed and the data exporting goes faster. But a fetch size of more than 2000 doesn’t improve performance but it does consume more memory.

    Separated the Greenplum JDBC driver into a separate JAR file so that it is possible to upgrade just this driver.

    Separated the User Interface classes into a separate JAR file so it needs less memory.

    Separated the Scheduler classes into a separate JAR file so it needs less memory.

    Separated the NanoHttpd classes into a separate JAR file so it can be upgraded independently of Outsourcer.

    Fixed a bug in the UI where the SQL Text Box wasn’t visible when creating a Transform Job.

    Fixed a bug in the UI where quotes weren’t escaped properly for SQL entered into the SQL Text Box.

    Download

    Documentation

    Outsourcer 4.0

    PivotalGuru.com is pleased to announce Outsourcer 4.0!

    Overview Video

    New Features

  • Web based User Interface
  • Recurring Job Scheduling
  • Transform Jobs execution like other job types
  • Transform SQL added to every job type to execute last
  • Dependent Jobs are now supported
  • Dynamic Environment Variables
  • Check out the new Administrator Guide for more information.

    Link to Outsourcer 4.0.

    Outsourcer 3.1 and New Name!

    Outsourcer 3.1 is now available for download on the Installation page. Updated Documentation is also available on the Documentation page.

    What’s new?
    Oracle

    • FetchSize is now configurable. To minimize round trips to Oracle, make this setting rather large (greater than 10,000) which increases exporting speed but at the expense of needing more RAM. Adjust this setting up or down based on your environment. Default is 40,000.

    SQL Server

    • Fix to better handle non-default schemas

    DDL Refresh Type

    • Several customers have requested for Outsourcer to just create the tables in Greenplum based on the DDL in the source without loading the tables. This new Refresh Type does just that.

    New Name: PivotalGuru.com

    Greenplum is now part of Pivotal and to follow this name change, I now have the PivotalGuru.com site!

    Pivotal is an exciting and rather large startup company with investments from EMC, VMWare, and GE. I am proud to be part of this new company and very excited about the new offerings in Big Data and Cloud Computing that Pivotal will bring to market.

    Outsourcer 3.0.1

    • Fix for Oracle dates in which the time was being truncated that started with version 2.7.  With 2.7, I added code to handle two digit years which Oracle uses for centuries between 0 and 100 but I used getDate instead of getTimestamp.
    • Fix for Oracle JDBC driver mismatch which may cause ArrayIndexOutOfBoundsException when using “getString”.
    • Fix for double quotes around object name in function os.fn_create_ext_table.  If you already upgraded to 3.0 and have executed the upgrade_3_0.sql file, please recreate this function with this SQL file.  If you are upgrading from 2.x, be sure to execute the upgrade_3_0.sql file.

    Link to Outsourcer 3.0.1 => os_3_0_1

     

    Interesting Limit to Inserts with Java

    Outsourcer uses Java and in doing so, the INSERT statement uses the Statement Class and the executeUpdate method.  This method can be seen here.

    The executeUpdate is used for INSERT, UPDATE, or DELETE and returns the number of rows impacted by the statement as an int.  In Java, an “int” is defined with a maximum of 2,147,483,647.  This means if your SQL statement affects more than 2.1 Billion rows, then Java will throw an error.  Java int documentation here.

    A Greenplum customer using Outsourcer recently contacted me about this error message:

    Unable to interpret the update count in command completion tag: INSERT 0 3913383319

    This is the error message you will get from Java when you try to insert more than 2.1 Billion rows.  In this case, the customer inserted 3.9 Billion rows from Oracle using Outsourcer.

    If you hit this limit in Outsourcer, don’t worry.  The rows actually get inserted and all you need to do is ANALYZE the table.  It is an odd limit that can be avoided.  You can use a VIEW in the source to break up the source table into smaller parts or you could use the External Table feature in Outsourcer.  You can also just ignore this error because the data does get inserted.

     

    Outsourcer 3.0

    I took the time of around Thanksgiving to work on Outsourcer 3.0. I have used the feedback I have received to enhance the security and performance of the application. I also addressed one of the known issues and found and fixed an unreported bug.

    Links
    Administrator Guide
    os_3_0.zip

    Security

    • The External Tables created by Outsourcer no long contain the username and password for the source connection.
    • Oracle sources now use ALL_% objects rather than DBA_% objects. This means you don’t need to grant DBA or SELECT on the DBA Views in Oracle to use Outsourcer.
    • TRUST authentication is used to start Outsourcer so no password is stored.

    Enhancements

    • Oracle connections are now faster and use less memory than with 2.x.
    • New External Table feature for defining a SQL statement to execute in Oracle or SQL Server to be the source of an External Table.
    • Better examples of the job types for Oracle and SQL Server

    Fixes

    • When a job fails and under certain conditions, Outsourcer could leave a database connection open to the source. Outsourcer now properly closes the connection on failed jobs.
    • If multiple jobs executed at the same time, it was possible for more than 1 job to attempt to create the target schema at the same time. This no longer is a problem.

    Outsourcer 2.9.3

    I found that for Oracle tables stored in MixedCase, I wasn’t creating the SELECT statement correctly but with 2.9.3, this is fixed. I somehow forgot to put double quotes around the table and schema name. This only affects Oracle objects stored in mixed case.

    OS_2_9_3.zip

    Outsourcer 2.9.2

    I would like to thank the great people at Virgin Australia for helping me find some opportunities to support more features from the Oracle database. Together we identified that I was incorrectly handling ROWID which I fixed in 2.9. With that release, I thought I would also make the support of an Oracle INTEGER better only to realize that I was already handling it in the optimum way so I fixed that problem in 2.9.1. Lastly, I am now support dollar signs for source instance, database, schema, and table names which is the 2.9.2 release.

    os_2_9_2.zip

    Here is a link to the updated Administrator’s Guide which removes the known issue about a dollar sign in the password.

    Administrator’s Guide

    Outsourcer 2.9.1

    I made a mistake in converting Oracle’s INT to Greenplum’s INT or BIGINT. Oracle doesn’t really have an INT data type and instead has a synonym for NUMBER. To complicate matters, the Oracle INT doesn’t conform to the limits of INT or BIGINT so these data types must be converted to NUMERIC in Greenplum which doesn’t have limits.

    Outsourcer 2.9.1

    Outsourcer 2.9

    Outsourcer 2.9 contains enhancements for Oracle connections.

    First, Outsourcer now supports ROWID and UROWID properly. In both cases, these Oracle data types are converted to VARCHAR in Greenplum.

    Secondly, Oracle internally stores an Integer as a Numeric which doesn’t perform in Greenplum as a true Integer. Outsourcer now correctly identifies an Integer in Oracle.

    The last enhancement is in the JDBC connection. Instead of using the SID, Outsourcer now uses the SERVICE NAME which supports Oracle Connection Manager.

    Link to new download => OS_2_9.zip

    Link to new documentation => Outsourcer_Administrator_Guide_2_9

    Outsourcer 2.7

    I have a new version which is only a small bug fix for Oracle dates. If you are storing dates in Oracle less than year 100, then you will need this new version. If you aren’t, then you don’t need to upgrade.

    Details:
    Oracle has the ability to store dates between January 1, 4712 BCE through December 31, 4712 CE and I convert the dates to a string to load it. I was relying on the locale to do this conversion but Java apparently converts the date 0099-05-01 to 99-05-01. This doesn’t follow the ISO date standard so it fails to load this value.

    My fix is to simply format Oracle DATE columns to a four digit year.

    Link:
    Outsourcer 2.7

    Reference:
    Oracle Documentation

    Dynamic External Tables

    A customer encouraged me to write a post about how it is possible to use shell scripts to dynamically alter the results from an External Web Table.

    For this example, I will create a table in both Greenplum and SQL Server that will contain a list of files.  I want to just get the records from the External Web Table that I need dynamically rather than retrieving all of the rows from SQL Server or by recreating the External Web Table. Note: Outsourcer has “Append” jobs for getting just the new records but is has been designed for Integer data types, not character like in this example.

    Greenplum Table:

    CREATE TABLE myfiles
    (filename TEXT NOT NULL PRIMARY KEY,
    dir TEXT NOT NULL)
    DISTRIBUTED BY (filename);
    
    INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');

    SQL Server Table:

    CREATE TABLE myfiles
    (filename VARCHAR(500) NOT NULL PRIMARY KEY,
     dir VARCHAR(MAX) NOT NULL);
    
    INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');
    --Note that these three rows don't exist in Greenplum
    INSERT INTO myfiles VALUES ('2012-01-06_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-07_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-08_10-30-00_data.txt', '/data/dump');

    Shell Script: /Users/gpadmin/get_myfiles.sh
    Please note that the use of Outsourcer.jar requires creating a connection in os.ext_connection. Review the Outsourcer documentation on creating External Tables for more information.

    max=`psql -A -t -c "SELECT MAX(filename) FROM myfiles"`
    java -classpath /usr/local/os/jar/Outsourcer.jar:/usr/local/os/jar/sqljdbc4.jar:/usr/local/os/jar/ojdbc6.jar -Xms128m -Xmx256m ExternalData gpdb 5432 1 "SELECT filename, dir FROM poc.dbo.myfiles WHERE filename > '$max'"

    Dynamic External Web Table in Greenplum

    CREATE EXTERNAL WEB TABLE ext_myfiles
    (filename TEXT,
     dir TEXT)
    EXECUTE E'/Users/gpadmin/get_myfiles.sh' ON MASTER
    FORMAT 'text' (delimiter '|' null 'null' escape E'\\')
    ENCODING 'UTF8';

    Insert statement in Greenplum

    INSERT INTO myfiles SELECT * FROM ext_myfiles;

    This technique will only retrieve the rows from SQL Server that need to be inserted. When dealing with millions of records, this will greatly improve performance.

    You can manipulate your script to retrieve data based on anything you want all without having to recreate your External Web Table.

    Archiving Data

    A customer recently had a pretty common question about archiving old data in Greenplum.  They wanted to purge old data in their OLTP system so that it remains small and fast.  Historical data that isn’t changing, stays in Greenplum.

    An easy way to do this is with a view and two tables.  The view would UNION ALL the data between the two tables.  Unfortunately, Greenplum treats UNION ALL similarly to other databases which executes each query separated by the UNION ALL serially.  The optimal way to handle this is to partition the data.

    So let’s say the customer has a schema called edw and a table called auths.

    create schema edw;
    
    CREATE TABLE edw.auths
    (id int not null,
    fname varchar,
    lname varchar,
    episodedate timestamp)
    DISTRIBUTED BY (id);

    Insert some sample data.

    INSERT INTO edw.auths VALUES (1, 'foo', 'bar', '2001-05-01'::timestamp);
    INSERT INTO edw.auths VALUES (2, 'foo', 'bar', '2012-12-01'::timestamp);
    INSERT INTO edw.auths VALUES (3, 'too', 'cool', '2010-01-01'::timestamp);
    INSERT INTO edw.auths VALUES (4, 'too', 'cool', '2010-01-02'::timestamp);
    INSERT INTO edw.auths VALUES (5, 'old', 'data', '2004-01-02'::timestamp);

    Next, rename this schema.

    ALTER SCHEMA edw RENAME TO edw_archive;

    Create the new edw schema.

    CREATE SCHEMA edw;

    In this case, anything older than 2011-01-01 will be archived and anything newer will be in the current OLTP system.

    CREATE TABLE edw.auths
    (LIKE edw_archive.auths)
    PARTITION BY RANGE (episodedate)
    (START ('2000-01-01'::timestamp) INCLUSIVE END ('2010-01-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5),
    DEFAULT PARTITION auths_current);

    I’m using a pretty high level of compression and column orientation for the archive data.  This will help performance and use less disk space for the archived data.  The default partition has the current data in it and it remains a row store that isn’t compressed.   This is a good example of Polymorphic storage where partitions within the same table are stored in different ways.

    Insert the old data into the new table.

    INSERT INTO edw.auths SELECT * FROM edw_archive.auths;

    Just to show that everything works.

    SELECT * FROM edw.auths;

    With Outsourcer, you will want to change the job to load edw.auths to insert data into the default partition and don’t touch the archive partition.  To do this, get the names of the underlying inherited tables.

    SELECT * FROM pg_partitions WHERE schemaname = 'edw' AND tablename = 'auths';

    You can see the two tables independently.

    --new data
    
    SELECT * FROM edw.auths_1_prt_auths_current;
    
    --archive data
    
    SELECT * FROM edw.auths_1_prt_2;

    To change Outsourcer, first remove the old job for edw.auths and insert the new one with the target of edw.auths_1_prt_auths_current.

    DELETE FROM os.job WHERE (target).schema_name = 'edw' AND (target).table_name = 'auths';
    
    INSERT INTO os.job(
    refresh_type, target, source)
    VALUES ('refresh',
    ('edw', 'auths_1_prt_auths_current'),
    ('sqlserver', 'jonnywin', null, null, 'edw', 'dbo', 'auths', 'sa', 'sa'));

    Now when you execute this job, it will truncate the current partition, insert the data from the OLTP system, and then analyze the partition.  The archive partition will be left untouched.  Querying data across the archived and current data will execute much faster than a UNION ALL because it runs in parallel and it is compressed and column oriented.

    Ugly Data

    It seems that everyone has it. Ugly data. Data that isn’t so clean in transactional systems like control characters or rarely used characters that are used programmatically. This ugly data typically errors out during a load process. This requires someone to cleanse the data by writing scripts and then loading the data.

    This is time consuming and feels like you are re-inventing the wheel over and over again.

    Outsourcer does the cleansing for you! As the data is written to Greenplum, control characters are removed and special characters are escaped properly.

    Here is a demo of Outsourcer in action loading ugly data.

    Oracle Sessions

    Just like for SQL Server, you may want to know what is going on in Oracle from inside Greenplum.  Here is an External Web Table that uses Outsourcer to give you that information.

    First create a connection for Outsourcer:

    INSERT INTO os.ext_connection
    (type, server_name, instance_name, port, database_name, user_name, pass)
    VALUES
    ('oracle', 'jonnywin', null, 1521, 'xe', 'os_test', 'os_password');
    
    --get the ID you just inserted
    SELECT id 
    FROM os.ext_connection 
    WHERE type = 'oracle' 
    AND server_name = 'jonnywin'
    AND port = 1521
    AND database_name = 'xe';
    
    --in my example, the value is 3.
    

    Next create the schema in Greenplum for this table:

    CREATE SCHEMA oracle;
    

    This will create the External Table in Greenplum for Outsourcer:

    --notice the double escaping of the dollar sign.
    SELECT os.fn_create_ext_table('oracle.sessions', ARRAY[
    'username varchar(30)','osuser varchar(30)','machine varchar(64)',
    'terminal varchar(30)','program varchar(48)',
    'module varchar(48)','sql_text text','logon_time timestamp',
    'service_name varchar(64)'], 
    3,
    'SELECT s.USERNAME, s.OSUSER, s.MACHINE, s.TERMINAL, s.PROGRAM, s.MODULE,
    a.sql_text, s.LOGON_TIME, s.service_name 
    FROM v\\\\$session s, v\\\\$sqlarea a 
    WHERE a.address = s.sql_address');
    

    Now you can execute the query to get the session information:

    select * from oracle.sessions;

    More information on External Tables is in the Outsourcer documentation.

    Why is SQL Server so slow?

    This is a very good question.  When using Outsourcer, you may find a job run longer or shorter from time to time and in my experience, it is because the source database is waiting on something.  It is the bottleneck 99% of the time.  So why is it slow today versus yesterday?

    Databases have wait stats and SQL Server has this information too.  You can query SQL Server to find what it is waiting on by logging into SQL Server and executing a SQL command.  With the code provided below, you can leverage Outsourcer to query SQL Server from Greenplum!

    First create a connection for Outsourcer:

    INSERT INTO os.ext_connection
    (type, server_name, instance_name, port, database_name, user_name, pass)
    VALUES
    ('sqlserver', 'jonnywin', null, null, null, 'os_test', 'os_password');
    
    --get the ID you just inserted
    SELECT id 
    FROM os.ext_connection 
    WHERE type = 'sqlserver' 
    AND server_name = 'jonnywin'
    
    --in my example, the value is 1.
    

    Next create the schema in Greenplum for this table:

    CREATE SCHEMA sqlserver;
    

    This will create the External Table in Greenplum for Outsourcer:

    SELECT os.fn_create_ext_table('sqlserver.sessions', 
    ARRAY['sql_time timestamp','start_time timestamp','status varchar(30)',
    'session_id smallint','sqltext text'], 
    1,
    'SELECT getdate() as sql_time, req.start_time, req.status, 
    req.session_id, sqltext.TEXT as sqltext 
    FROM sys.dm_exec_requests req 
    CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext 
    order by req.start_time');
    

    Now you can execute the query to get the session information:

    select * from sqlserver.sessions;

    I have found that the most critical thing to look for from SQL Server is the “status”.  You will see it “Running” when it is actually doing something but “Suspended” when it is waiting on a resource.

    So if you see queries generated by Outsourcer “Suspended” in SQL Server, that is why it is slower today versus yesterday.  Work with your SQL Server DBA to determine what resource(s) SQL Server is waiting on next.

    More information on External Tables is in the Outsourcer documentation.