The new release focuses on compatibility.
– Added support for Greenplum Database 5.0
– Removed support for Greenplum Database 4.2
– Removed support for HAWQ 2.0.0 and earlier
Outsourcer 5.1.5 has the following changes:
– Fixed os.queue column_name being incorrectly set to a value after a refresh job completes. The value should have been left blank.
– Installer script was enhanced to create tables with hash distribution for Greenplum and HAWQ 1.3 and with random distribution for HAWQ 2.0.
5.1.1 enhances Append jobs to use Big Integer in addition to Integer data types. Additionally, you can now use Timestamp data types.
Be sure to always use an ordered sequence in Oracle and an ordered identity in SQL Server when using an Append job. Timestamp is useful when you are using the system timestamp in Oracle or SQL Server to append new data.
5.0.9 adds support for HAWQ 2.0.
I’m looking for feedback on how best to handle table distribution for tables created in HAWQ 2.0. Outsourcer automatically sets distribution keys based on the source primary keys so it always uses hash when there is a PK found. HAWQ 2.0 supports hash and random distribution as before but random distribution allows a cluster to be resized without having to redistribute the data.
– Should I keep the code as-is?
– Should I set a global environment variable to allow you to set all tables to be created random or not?
– Should I update nearly every UI screen as well as the job and queue tables to have a random boolean that is only used for HAWQ 2.0?
5.0.8 enhances table storage for HAWQ tables. Outsourcer no longer creates “column” oriented tables in HAWQ and instead uses “Parquet”. Additionally, when HAWQ tables use both Compression and Parquet, the compression algorithm is now Snappy rather than Quicklz. Compressed row oriented tables still use Quicklz.
I have updated GPLink to handle timestamp and date formats problems. This has only been a problem with Oracle’s JDBC driver but it may be needed with other databases.
I created a new project that simplifies the process to create Greenplum or Hawq External Tables using gpfdist to stream data from any valid JDBC source. It is like pointing gpfdist at Sqoop to pull data without landing a file but gplink ensures that the data is cleansed first so that the data will be readable by Greenplum or Hawq.
This will work with PostgreSQL (yeah!), MySQL (yuck), DB2, Informix, etc. You will have to download all third party JDBC drivers separately.
This is a new project so I’m looking forward to any feedback you can provide.
1. Error messages raised from Oracle and SQL Server could include special characters which would cause the thread in Outsourcer to die and the status remain labeled as “processing”. Error messages no longer have escape and quote characters to prevent this problem.
2. On new installations, the permissions for the Oracle and SQL Server Jar files would remain owned by root. This has now been corrected to be the admin user (gpadmin).
3. On the “Sources” screen, the label for Append-Only now is correct for Greenplum Database 4.3 and reflects that these are really “Append-Optimized” tables.
1. Added External Tables to enable starting and stopping the UI rather than having to ssh to the Master host.
2. “Queue” screen now provides the ability to cancel processing jobs.
3. Stopping Outsourcer Queue Daemon (Environment Screen), now will cancel currently running Jobs in the Queue.
4. Starting Outsourcer Queue Daemon (Environment Screen), now will make sure orphaned jobs in the Queue will be marked as Failed and if any are currently running, it will cancel these jobs too.
Outsourcer 4.0.1 is mainly bug fixes and enhancements for Oracle.
Changed the default Oracle Fetch Size from 40,000 to 2000. 2000 seems to be the point of diminishing returns for any values greater. The Oracle default is only 10, which makes exporting data very slow. The higher the value, the more memory is needed and the data exporting goes faster. But a fetch size of more than 2000 doesn’t improve performance but it does consume more memory.
Separated the Greenplum JDBC driver into a separate JAR file so that it is possible to upgrade just this driver.
Separated the User Interface classes into a separate JAR file so it needs less memory.
Separated the Scheduler classes into a separate JAR file so it needs less memory.
Separated the NanoHttpd classes into a separate JAR file so it can be upgraded independently of Outsourcer.
Fixed a bug in the UI where the SQL Text Box wasn’t visible when creating a Transform Job.
Fixed a bug in the UI where quotes weren’t escaped properly for SQL entered into the SQL Text Box.
PivotalGuru.com is pleased to announce Outsourcer 4.0!
Check out the new Administrator Guide for more information.
Link to Outsourcer 4.0.
Outsourcer 3.1 is now available for download on the Installation page. Updated Documentation is also available on the Documentation page.
- FetchSize is now configurable. To minimize round trips to Oracle, make this setting rather large (greater than 10,000) which increases exporting speed but at the expense of needing more RAM. Adjust this setting up or down based on your environment. Default is 40,000.
- Fix to better handle non-default schemas
DDL Refresh Type
- Several customers have requested for Outsourcer to just create the tables in Greenplum based on the DDL in the source without loading the tables. This new Refresh Type does just that.
New Name: PivotalGuru.com
Greenplum is now part of Pivotal and to follow this name change, I now have the PivotalGuru.com site!
Pivotal is an exciting and rather large startup company with investments from EMC, VMWare, and GE. I am proud to be part of this new company and very excited about the new offerings in Big Data and Cloud Computing that Pivotal will bring to market.
This is the fifth in a series of “How do I do X in Greenplum database” and today’s topic is How do I prevent “blocking locks” in Greenplum?
Since different databases handle read consistency and locking differently, I will discuss how Oracle, Microsoft SQL Server, PostgreSQL, and finally Greenplum handle locking. This will give you a good way to compare and contrast the different products.
Oracle handles read consistency with using “rollback”. When you update a table, the row is overwritten but first, the old version of the data is put into a rollback segment and the row is marked with a timestamp and a pointer to the old version. This allows users to query the database and not be blocked while an update happens.
If a query started at time1 and and update happens at time2, the query will see a snapshot of the data as of time1. To make this work, the query follows the pointer to the rollback segment to get the old version of the rows that were updated.
There are a couple of problems with this solution. The first is Disk IO because the database has to first read the data file and then read the rollback segment.
The second problem with this is the rollback segment must be sized pretty large in a warehouse environment when you can easily have a query execute for a long period of time. You will eventually run out of rollback and get the “snapshot too old” error on long running queries if you are actively changing data.
SQL Server uses locking by default to provide read consistency. This means an update or even an insert can block other sessions from reading a table. Yuck. Starting with SQL Server 2005, you can enable Read Committed Snapshot Isolation (RCSI) which makes the database behave more like Oracle but it uses TempDB instead of rollback for the old version of the rows updated.
SQL Server Problems
To get around locking, developers will use “read uncommitted” which is a dirty read. I never use a dirty read because it is a hack. Don’t do it.
Using RCSI uses TempDB so more IO is involved. Unfortunately, I also almost never see it used in companies. Most SQL Server shops don’t even know this option is available so they continue with their dirty reads.
PostgreSQL uses Multi-Version Concurrency Control (MVCC) to handle read consistency. In a nutshell, readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.
To a user, PostgreSQL behaves similarly to Oracle but it doesn’t use rollback. Instead, an update adds a new row version to the existing table. Because PostgreSQL doesn’t use rollback, you will never get a “snapshot too old” error on long running queries.
The downside to PostgreSQL happens when an update affects a large percentage of a table. If you update an entire table, it effectively doubles the size on disk. To make the old rows available again, a “vacuum” will need to be run so that new updates and/or inserts will use the deleted rows. PostgreSQL uses a background daemon to automatically vacuum tables so this usually isn’t a big deal.
Greenplum is a fork of PostgreSQL 8.2 so it uses MVCC to handle read consistency. Readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.
Greenplum MVCC Exception
Greenplum differs from PostgreSQL for update and delete commands. It acquires an Exclusive Lock on a table. However, this lock doesn’t block readers. SQL queries will return data and users will see a snapshot of the data that isn’t dirty. This Exclusive Lock does however block other update and delete commands which is different from PostgreSQL.
Greenplum Database provides read consistency with Multi-Version Concurrency Control (MVCC) similarly to PostgreSQL but update and delete operations are serialized. You don’t need to worry about blocking locks for selecting data because MVCC uses snapshots of the database instead of blocking locks. You can get a blocking lock on update and delete when another update or delete command is executing on the same table but readers aren’t blocked! You will likely never even notice this lock because the database will handle the updates and deletes one at a time while you can still select data.
A customer encouraged me to write a post about how it is possible to use shell scripts to dynamically alter the results from an External Web Table.
For this example, I will create a table in both Greenplum and SQL Server that will contain a list of files. I want to just get the records from the External Web Table that I need dynamically rather than retrieving all of the rows from SQL Server or by recreating the External Web Table. Note: Outsourcer has “Append” jobs for getting just the new records but is has been designed for Integer data types, not character like in this example.
CREATE TABLE myfiles (filename TEXT NOT NULL PRIMARY KEY, dir TEXT NOT NULL) DISTRIBUTED BY (filename); INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');
SQL Server Table:
CREATE TABLE myfiles (filename VARCHAR(500) NOT NULL PRIMARY KEY, dir VARCHAR(MAX) NOT NULL); INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump'); --Note that these three rows don't exist in Greenplum INSERT INTO myfiles VALUES ('2012-01-06_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-07_10-30-00_data.txt', '/data/dump'); INSERT INTO myfiles VALUES ('2012-01-08_10-30-00_data.txt', '/data/dump');
Shell Script: /Users/gpadmin/get_myfiles.sh
Please note that the use of Outsourcer.jar requires creating a connection in os.ext_connection. Review the Outsourcer documentation on creating External Tables for more information.
max=`psql -A -t -c "SELECT MAX(filename) FROM myfiles"` java -classpath /usr/local/os/jar/Outsourcer.jar:/usr/local/os/jar/sqljdbc4.jar:/usr/local/os/jar/ojdbc6.jar -Xms128m -Xmx256m ExternalData gpdb 5432 1 "SELECT filename, dir FROM poc.dbo.myfiles WHERE filename > '$max'"
Dynamic External Web Table in Greenplum
CREATE EXTERNAL WEB TABLE ext_myfiles (filename TEXT, dir TEXT) EXECUTE E'/Users/gpadmin/get_myfiles.sh' ON MASTER FORMAT 'text' (delimiter '|' null 'null' escape E'\\') ENCODING 'UTF8';
Insert statement in Greenplum
INSERT INTO myfiles SELECT * FROM ext_myfiles;
This technique will only retrieve the rows from SQL Server that need to be inserted. When dealing with millions of records, this will greatly improve performance.
You can manipulate your script to retrieve data based on anything you want all without having to recreate your External Web Table.
Merge is a feature in most databases but I have never liked it much. My source table always had duplicates in it which prevented me from using it.
Here is an example of such with Microsoft SQL Server (BTW, Oracle behaves the same way):
CREATE TABLE stage ( id int identity not null, provider_id int NOT NULL, provider text NOT NULL ); insert into stage (provider_id, provider) values (1, 'bubba'); insert into stage (provider_id, provider) values (2, 'gump'); insert into stage (provider_id, provider) values (3, 'foo'); insert into stage (provider_id, provider) values (4, 'bar'); insert into stage (provider_id, provider) values (4, 'jon'); CREATE TABLE target ( provider_id int NOT NULL primary key, provider text NOT NULL );
My stage table is where new records get loaded from a file and then my target table is what I’m trying to keep up to date. So using a MERGE statement, I should be able to insert the new records and if they exist already, update it.
merge target using (select * from stage) as stage on (target.provider_id = stage.provider_id) when matched then update set provider = stage.provider when not matched then insert (provider_id, provider) values (stage.provider_id, stage.provider);
That is the syntax for MERGE in SQL Server and when executed, I get this error message:
Msg 2627, Level 14, State 1, Line 1 Violation of PRIMARY KEY constraint 'PK__target__00E213107132C993'. Cannot insert duplicate key in object 'dbo.target'. The statement has been terminated.
Why? Well my stage table has two records with the same provider_id. When this happens, I don’t want to error but instead, I want the last value in the file to be used. Unfortunately, MERGE doesn’t handle this. Instead, you have to write code.
In Greenplum, here is how you do a “smart” MERGE that handles real world data where the target has duplicates:
CREATE TABLE stage ( id serial not null, provider_id int NOT NULL, provider text NOT NULL ) DISTRIBUTED BY (provider_id); insert into stage (provider_id, provider) values (1, 'bubba'); insert into stage (provider_id, provider) values (2, 'gump'); insert into stage (provider_id, provider) values (3, 'foo'); insert into stage (provider_id, provider) values (4, 'bar'); insert into stage (provider_id, provider) values (4, 'jon'); CREATE TABLE target ( provider_id int NOT NULL primary key, provider text NOT NULL ) DISTRIBUTED BY (provider_id);
Now create the function:
create or replace function fn_merge_target() returns void as $$ declare v_function text := 'fn_merge_target'; v_location int; begin v_location := 1000; --delete the old records delete from target t using stage s where t.provider_id = s.provider_id and t.provider <> s.provider; v_location := 2000; --inserts insert into target (provider_id, provider) select sub.provider_id, sub.provider from ( select s.*, rank() over (partition by s.provider_id order by s.id desc) as rank from stage s left join target t on s.provider_id = t.provider_id where t.provider is null ) as sub where sub.rank = 1; exception when others then raise exception '(%:%:%)', v_function, v_location, sqlerrm; end; $$ language plpgsql;
Execute the function:
And now query target:
select * from target order by provider_id;
I have 4 rows and for provider_id 4, I have the last version from the stage table rather than failing the command.
In summary, MERGE sounds nice but I’ve never been able to use it. My source or stage table always has duplicates and instead of failing, I want to apply the last version of each primary key record. I demonstrated an easy to use way to provide a much smarter MERGE that is optimized for Greenplum and uses one of the built in analytical functions (rank()).
This is a very good question. When using Outsourcer, you may find a job run longer or shorter from time to time and in my experience, it is because the source database is waiting on something. It is the bottleneck 99% of the time. So why is it slow today versus yesterday?
Databases have wait stats and SQL Server has this information too. You can query SQL Server to find what it is waiting on by logging into SQL Server and executing a SQL command. With the code provided below, you can leverage Outsourcer to query SQL Server from Greenplum!
First create a connection for Outsourcer:
INSERT INTO os.ext_connection (type, server_name, instance_name, port, database_name, user_name, pass) VALUES ('sqlserver', 'jonnywin', null, null, null, 'os_test', 'os_password'); --get the ID you just inserted SELECT id FROM os.ext_connection WHERE type = 'sqlserver' AND server_name = 'jonnywin' --in my example, the value is 1.
Next create the schema in Greenplum for this table:
CREATE SCHEMA sqlserver;
This will create the External Table in Greenplum for Outsourcer:
SELECT os.fn_create_ext_table('sqlserver.sessions', ARRAY['sql_time timestamp','start_time timestamp','status varchar(30)', 'session_id smallint','sqltext text'], 1, 'SELECT getdate() as sql_time, req.start_time, req.status, req.session_id, sqltext.TEXT as sqltext FROM sys.dm_exec_requests req CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext order by req.start_time');
Now you can execute the query to get the session information:
select * from sqlserver.sessions;
I have found that the most critical thing to look for from SQL Server is the “status”. You will see it “Running” when it is actually doing something but “Suspended” when it is waiting on a resource.
So if you see queries generated by Outsourcer “Suspended” in SQL Server, that is why it is slower today versus yesterday. Work with your SQL Server DBA to determine what resource(s) SQL Server is waiting on next.
More information on External Tables is in the Outsourcer documentation.