The new release focuses on compatibility.
– Added support for Greenplum Database 5.0
– Removed support for Greenplum Database 4.2
– Removed support for HAWQ 2.0.0 and earlier
Outsourcer 5.1.5 has the following changes:
– Fixed os.queue column_name being incorrectly set to a value after a refresh job completes. The value should have been left blank.
– Installer script was enhanced to create tables with hash distribution for Greenplum and HAWQ 1.3 and with random distribution for HAWQ 2.0.
5.1.1 enhances Append jobs to use Big Integer in addition to Integer data types. Additionally, you can now use Timestamp data types.
Be sure to always use an ordered sequence in Oracle and an ordered identity in SQL Server when using an Append job. Timestamp is useful when you are using the system timestamp in Oracle or SQL Server to append new data.
5.0.9 adds support for HAWQ 2.0.
I’m looking for feedback on how best to handle table distribution for tables created in HAWQ 2.0. Outsourcer automatically sets distribution keys based on the source primary keys so it always uses hash when there is a PK found. HAWQ 2.0 supports hash and random distribution as before but random distribution allows a cluster to be resized without having to redistribute the data.
– Should I keep the code as-is?
– Should I set a global environment variable to allow you to set all tables to be created random or not?
– Should I update nearly every UI screen as well as the job and queue tables to have a random boolean that is only used for HAWQ 2.0?
5.0.8 enhances table storage for HAWQ tables. Outsourcer no longer creates “column” oriented tables in HAWQ and instead uses “Parquet”. Additionally, when HAWQ tables use both Compression and Parquet, the compression algorithm is now Snappy rather than Quicklz. Compressed row oriented tables still use Quicklz.
I have updated GPLink to handle timestamp and date formats problems. This has only been a problem with Oracle’s JDBC driver but it may be needed with other databases.
I created a new project that simplifies the process to create Greenplum or Hawq External Tables using gpfdist to stream data from any valid JDBC source. It is like pointing gpfdist at Sqoop to pull data without landing a file but gplink ensures that the data is cleansed first so that the data will be readable by Greenplum or Hawq.
This will work with PostgreSQL (yeah!), MySQL (yuck), DB2, Informix, etc. You will have to download all third party JDBC drivers separately.
This is a new project so I’m looking forward to any feedback you can provide.
1. Error messages raised from Oracle and SQL Server could include special characters which would cause the thread in Outsourcer to die and the status remain labeled as “processing”. Error messages no longer have escape and quote characters to prevent this problem.
2. On new installations, the permissions for the Oracle and SQL Server Jar files would remain owned by root. This has now been corrected to be the admin user (gpadmin).
3. On the “Sources” screen, the label for Append-Only now is correct for Greenplum Database 4.3 and reflects that these are really “Append-Optimized” tables.
1. Added External Tables to enable starting and stopping the UI rather than having to ssh to the Master host.
2. “Queue” screen now provides the ability to cancel processing jobs.
3. Stopping Outsourcer Queue Daemon (Environment Screen), now will cancel currently running Jobs in the Queue.
4. Starting Outsourcer Queue Daemon (Environment Screen), now will make sure orphaned jobs in the Queue will be marked as Failed and if any are currently running, it will cancel these jobs too.
Outsourcer 4.0.1 is mainly bug fixes and enhancements for Oracle.
Changed the default Oracle Fetch Size from 40,000 to 2000. 2000 seems to be the point of diminishing returns for any values greater. The Oracle default is only 10, which makes exporting data very slow. The higher the value, the more memory is needed and the data exporting goes faster. But a fetch size of more than 2000 doesn’t improve performance but it does consume more memory.
Separated the Greenplum JDBC driver into a separate JAR file so that it is possible to upgrade just this driver.
Separated the User Interface classes into a separate JAR file so it needs less memory.
Separated the Scheduler classes into a separate JAR file so it needs less memory.
Separated the NanoHttpd classes into a separate JAR file so it can be upgraded independently of Outsourcer.
Fixed a bug in the UI where the SQL Text Box wasn’t visible when creating a Transform Job.
Fixed a bug in the UI where quotes weren’t escaped properly for SQL entered into the SQL Text Box.
Oracle’s JDBC driver is challenging to say the least! So, I put together a guide on how to work with this data source for Outsourcer.
PivotalGuru.com is pleased to announce Outsourcer 4.0!
Check out the new Administrator Guide for more information.
Link to Outsourcer 4.0.
Outsourcer 3.1 is now available for download on the Installation page. Updated Documentation is also available on the Documentation page.
- FetchSize is now configurable. To minimize round trips to Oracle, make this setting rather large (greater than 10,000) which increases exporting speed but at the expense of needing more RAM. Adjust this setting up or down based on your environment. Default is 40,000.
- Fix to better handle non-default schemas
DDL Refresh Type
- Several customers have requested for Outsourcer to just create the tables in Greenplum based on the DDL in the source without loading the tables. This new Refresh Type does just that.
New Name: PivotalGuru.com
Greenplum is now part of Pivotal and to follow this name change, I now have the PivotalGuru.com site!
Pivotal is an exciting and rather large startup company with investments from EMC, VMWare, and GE. I am proud to be part of this new company and very excited about the new offerings in Big Data and Cloud Computing that Pivotal will bring to market.
- Fix for Oracle dates in which the time was being truncated that started with version 2.7. With 2.7, I added code to handle two digit years which Oracle uses for centuries between 0 and 100 but I used getDate instead of getTimestamp.
- Fix for Oracle JDBC driver mismatch which may cause ArrayIndexOutOfBoundsException when using “getString”.
- Fix for double quotes around object name in function os.fn_create_ext_table. If you already upgraded to 3.0 and have executed the upgrade_3_0.sql file, please recreate this function with this SQL file. If you are upgrading from 2.x, be sure to execute the upgrade_3_0.sql file.
Link to Outsourcer 3.0.1 => os_3_0_1
This is the fifth in a series of “How do I do X in Greenplum database” and today’s topic is How do I prevent “blocking locks” in Greenplum?
Since different databases handle read consistency and locking differently, I will discuss how Oracle, Microsoft SQL Server, PostgreSQL, and finally Greenplum handle locking. This will give you a good way to compare and contrast the different products.
Oracle handles read consistency with using “rollback”. When you update a table, the row is overwritten but first, the old version of the data is put into a rollback segment and the row is marked with a timestamp and a pointer to the old version. This allows users to query the database and not be blocked while an update happens.
If a query started at time1 and and update happens at time2, the query will see a snapshot of the data as of time1. To make this work, the query follows the pointer to the rollback segment to get the old version of the rows that were updated.
There are a couple of problems with this solution. The first is Disk IO because the database has to first read the data file and then read the rollback segment.
The second problem with this is the rollback segment must be sized pretty large in a warehouse environment when you can easily have a query execute for a long period of time. You will eventually run out of rollback and get the “snapshot too old” error on long running queries if you are actively changing data.
SQL Server uses locking by default to provide read consistency. This means an update or even an insert can block other sessions from reading a table. Yuck. Starting with SQL Server 2005, you can enable Read Committed Snapshot Isolation (RCSI) which makes the database behave more like Oracle but it uses TempDB instead of rollback for the old version of the rows updated.
SQL Server Problems
To get around locking, developers will use “read uncommitted” which is a dirty read. I never use a dirty read because it is a hack. Don’t do it.
Using RCSI uses TempDB so more IO is involved. Unfortunately, I also almost never see it used in companies. Most SQL Server shops don’t even know this option is available so they continue with their dirty reads.
PostgreSQL uses Multi-Version Concurrency Control (MVCC) to handle read consistency. In a nutshell, readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.
To a user, PostgreSQL behaves similarly to Oracle but it doesn’t use rollback. Instead, an update adds a new row version to the existing table. Because PostgreSQL doesn’t use rollback, you will never get a “snapshot too old” error on long running queries.
The downside to PostgreSQL happens when an update affects a large percentage of a table. If you update an entire table, it effectively doubles the size on disk. To make the old rows available again, a “vacuum” will need to be run so that new updates and/or inserts will use the deleted rows. PostgreSQL uses a background daemon to automatically vacuum tables so this usually isn’t a big deal.
Greenplum is a fork of PostgreSQL 8.2 so it uses MVCC to handle read consistency. Readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.
Greenplum MVCC Exception
Greenplum differs from PostgreSQL for update and delete commands. It acquires an Exclusive Lock on a table. However, this lock doesn’t block readers. SQL queries will return data and users will see a snapshot of the data that isn’t dirty. This Exclusive Lock does however block other update and delete commands which is different from PostgreSQL.
Greenplum Database provides read consistency with Multi-Version Concurrency Control (MVCC) similarly to PostgreSQL but update and delete operations are serialized. You don’t need to worry about blocking locks for selecting data because MVCC uses snapshots of the database instead of blocking locks. You can get a blocking lock on update and delete when another update or delete command is executing on the same table but readers aren’t blocked! You will likely never even notice this lock because the database will handle the updates and deletes one at a time while you can still select data.
Just like for SQL Server, you may want to know what is going on in Oracle from inside Greenplum. Here is an External Web Table that uses Outsourcer to give you that information.
First create a connection for Outsourcer:
INSERT INTO os.ext_connection (type, server_name, instance_name, port, database_name, user_name, pass) VALUES ('oracle', 'jonnywin', null, 1521, 'xe', 'os_test', 'os_password'); --get the ID you just inserted SELECT id FROM os.ext_connection WHERE type = 'oracle' AND server_name = 'jonnywin' AND port = 1521 AND database_name = 'xe'; --in my example, the value is 3.
Next create the schema in Greenplum for this table:
CREATE SCHEMA oracle;
This will create the External Table in Greenplum for Outsourcer:
--notice the double escaping of the dollar sign. SELECT os.fn_create_ext_table('oracle.sessions', ARRAY[ 'username varchar(30)','osuser varchar(30)','machine varchar(64)', 'terminal varchar(30)','program varchar(48)', 'module varchar(48)','sql_text text','logon_time timestamp', 'service_name varchar(64)'], 3, 'SELECT s.USERNAME, s.OSUSER, s.MACHINE, s.TERMINAL, s.PROGRAM, s.MODULE, a.sql_text, s.LOGON_TIME, s.service_name FROM v\\\\$session s, v\\\\$sqlarea a WHERE a.address = s.sql_address');
Now you can execute the query to get the session information:
select * from oracle.sessions;
More information on External Tables is in the Outsourcer documentation.