Monthly Archives: January 2014

When should I use Greenplum Database versus HAWQ?

I get this question a lot so I thought I would post the answer. Greenplum is a robust MPP database that works very well for Data Marts and Enterprise Data Warehouses that tackles historical Business Intelligence reporting as well as predictive analytical use cases. HAWQ provides the most robust SQL interface for Hadoop and can tackle data exploration and transformation in HDFS.

HAWQ
HAWQ is “HAdoop With Query” and is basically a port of Greenplum to store data natively in HDFS. This empowers everyone using Hadoop to use a friendlier and more robust programming interface to Hadoop. Tools for Hadoop have evolved over time because of the need to get more consumers using Hadoop in a more productive manner.

The HAWQ architecture is very similar right now with each segment having its own files but instead of storing these files local to each segment (shared nothing), the data is stored in HDFS. Over time, I believe this architecture will change to enable greater elasticity in the number of segments processing data but right now, Greenplum and HAWQ share this design.

Hadoop Architecture
There are lots of material online about how Hadoop works but I want to point out one of the principals of Hadoop and how it relates to having a relational database store its files in it. Simple Coherency Model is a “write-once-read-many access model for files” that HDFS uses to simplify the data model in Hadoop. Once you “put” a file into HDFS, you can’t edit it to change the file. You can remove the file but once you write it, that is it.

Database Features
There are many “small” features in Greenplum Database that either aren’t supported yet in HAWQ or will not because HDFS won’t allow it. These features need to be taken into consideration when choosing which solution to deploy on. For example, here is a short list of features not found in HAWQ but are in Greenplum database: LOCK TABLE, CREATE INDEX, CREATE TYPE, CREATE RULE, CREATE AGGREGATE, and CREATE LANGUAGE.

Two more commonly used features of Greenplum that are not in HAWQ are DELETE and UPDATE commands. When a database executes a DELETE or UPDATE, it is editing the file. It basically needs to mark the old row as removed which requires changing the file. This isn’t allowed in HDFS so consequently, UPDATE and DELETE commands are not allowed in HAWQ.

HAWQ Features
Now there are some features of HAWQ that don’t exist in Greenplum database that are also compelling. PXF is the Pivotal Extensible Framework that provides enhanced capabilities when working with data in HDFS from HAWQ. Simply put, you can create an External Table to HDFS that reads Text, Hive, HBase, and soon Parquet. These are optimized for these formats and can leverage partitioning and predicate push-down. Another feature of this is the ability to gather statistics on these external tables to HDFS. So if you a large amount invested in Hadoop already, HAWQ can leverage that investment and extend it with more capabilities.

How does it all fit together?
A lot of companies are embracing Hadoop as it is an inexpensive, commodity hardware based, scalable solution that can handle terabytes to petabytes worth of data. HAWQ makes Hadoop better by providing a robust SQL interface built on over 10 years development of the Greenplum database.

Companies are building their “data lakes” and using Pig, Hive, HBase, and now SQL to explore and transform their data. This is where HAWQ excels.

Once data is transformed, companies are loading the relational data for consumption into Data Marts or Enterprise Data Warehouses. This is where Greenplum database comes in by leveraging a robust MPP database to provide historical Business Intelligence capabilities and forward looking predictive analytics all in-database with SQL.

Architecture

Seeing “Invisible” Data

Note: Pivotal Support requested that I mention that this GUC, along with all other hidden GUCs, are not supported. Being able to see deleted data has some benefit in extreme situations but can get you into trouble if used improperly. Do not open a support ticket on this GUC or any other hidden GUC as Pivotal Support will not be able to help. It is hidden for a reason!

I stumbled on this neat little feature today from a hidden GUC (Grand Unified Configuration). If you aren’t familiar with the term GUC, it is basically a configuration value that can be set for a session with “SET” or in the postgresql.conf file using gpconfig.

First, let’s create a table and insert some data.

gpdb=# CREATE TABLE foo (id int NOT NULL, bar text NOT NULL) DISTRIBUTED BY (id);
CREATE TABLE
gpdb=# INSERT INTO foo SELECT i, 'test:' || i FROM generate_series(1, 10) AS i;
INSERT 0 10

Now select the data.

gpdb=# SELECT * FROM foo ORDER BY id;
 id |   bar   
----+---------
  1 | test:1
  2 | test:2
  3 | test:3
  4 | test:4
  5 | test:5
  6 | test:6
  7 | test:7
  8 | test:8
  9 | test:9
 10 | test:10
(10 rows)

Now do something like delete all of the data.

gpdb=# DELETE FROM foo;
DELETE 10

Oh no! I need that data still! How can I get it back?

gpdb=# SET gp_select_invisible = TRUE;
SET
gpdb=# SELECT * FROM foo ORDER BY id;
 id |   bar   
----+---------
  1 | test:1
  2 | test:2
  3 | test:3
  4 | test:4
  5 | test:5
  6 | test:6
  7 | test:7
  8 | test:8
  9 | test:9
 10 | test:10
(10 rows)

As you can see, the old data is still there but marked as deleted. You can still see it by turning this hidden GUC on.

Now here it is turned off to see the data normally.

gpdb=# SET gp_select_invisible = FALSE;
SET
gpdb=# SELECT * FROM foo ORDER BY id;
 id | bar 
----+-----
(0 rows)

You can use this to see non-committed data being inserted into a heap table but not an append-only table. Once you vacuum or truncate a table, the invisible data gets removed so you won’t be able to see it anymore. BTW, this works in HAWQ too!

Outsourcer 4.0.2 Released

Outsourcer 4.0.2 is now available for download.

New to 4.0.2

  • Corrected Open Source license file. It is now using a BSD license and the NanoHTTPD license (web server).
  • Corrected install.sql file that incorrectly had $BODY$ for one function in the declaration.
  • Corrected cookie conflict with Command Center.
  • Reduced the number of database calls when using the Web interface.
  • Removed updates from the os.sessions table for the Web interface.
  • Links

  • 4.0.2 Zip
  • 4.0.2 Administrator’s Guide
  • 4.0.2 Source Code
  • Greenplum Database 4.3

    Pivotal recently released this new version without any major announcements but customers should definitely check it out! There are three main enhancements that I will cover in this blog post.

    1. Append Optimized Tables
    2. WAL Standby Replication
    3. Orca Optimizer

    Append-Optimized Tables
    Greenplum database has had “Append-Only” tables for several versions. It only allowed for inserting new data to a table which has a savings of about 20 bytes per row because it doesn’t need the overhead of per-row visibility. Append-Only tables allows you to use compression or change the orientation to column so it is used for larger tables in the database.

    With Greenplum Database 4.3, Append-Only tables are now Append-Optimized. You still get the benefits of an Append-Only table but now you will be able to UPDATE and DELETE from these tables as well.

    Here is an example with a compressed and column oriented table that is Append-Optimized:

    gpdb=# CREATE TABLE bar (a int, b text)
    gpdb-# WITH (appendonly=true, orientation=column, compresstype=quicklz)
    gpdb-# DISTRIBUTED BY (a);
    CREATE TABLE
    --Notice how it is still using "appendonly" in the syntax
    

    INSERT some data and SELECT from this new table.

    gpdb=# INSERT INTO bar VALUES (1, 'jon');
    INSERT 0 1
    gpdb=# INSERT INTO bar VALUES (2, 'roberts');
    INSERT 0 1
    gpdb=# SELECT * FROM bar;
     a |    b    
    ---+---------
     2 | roberts
     1 | jon
    (2 rows)
    

    UPDATE example:

    gpdb=# UPDATE bar SET b = UPPER(b);
    UPDATE 2
    gpdb=# SELECT * FROM bar;
     a |    b    
    ---+---------
     1 | JON
     2 | ROBERTS
    (2 rows)
    

    DELETE example:

    gpdb=# DELETE FROM bar WHERE a = 1;
    DELETE 1
    gpdb=# SELECT * FROM bar;
     a |    b    
    ---+---------
     2 | ROBERTS
    (1 row)
    

    This looks just like any other table but the difference is in the row versioning. Instead of using MVCC (which was borrowed from PostgreSQL), Append-Optimized tables have a Visibility Map which is a bitmap of what rows are no longer visible due to a DELETE or UPDATE statement on the table.

    Because it isn’t using MVCC anymore, serializable transactions with UPDATE and DELETE to an Append-Optimized table are not allowed. A serializable transaction is a strict transaction isolation in which transactions execute as if they run one after another rather than concurrently. Serializable is almost never used in a big data platform so it isn’t a big deal.

    Another change is how VACUUM works. Instead of marking stale rows as deleted so that row can be overwritten (MVCC), VACUUM compacts the physical files and resets the visibility map.

    When should use use a Heap table (default) or an Append-Optimized table? If you have any OLTP work in Greenplum, Heap is better. If you have frequent UPDATE or DELETE statements, heap is likely better because VACUUM should run faster. Query performance may also be a bit better for smaller tables that are Heap. So for larger tables that have no or infrequent UPDATE or DELETE statements, use Append-Optimized. If you want to compress the table or make it column oriented, you must make it Append-Optimized.

    WAL Standby Replication
    Greenplum database has a Master server where users connect and execute queries. The Master server manages sessions, security, metadata, and query plans. Queries are executed on the Segment servers in the Greenplum cluster. To provide high availability, the Master server has a Standby-Master server. Prior to 4.3, Greenplum used block level replication to keep the data on the Master in sync with Standby-Master.

    With 4.3, the WAL (Write-Ahead Log) is now streamed to the Standby-Master that is in a standby mode. This provides faster promotion of the Standby-Master in the event where the Master fails. It also has an improvement in performance for the Master server.

    Orca
    Engineering has refactored the query optimizer to make it easier to maintain and enhance. They also took this opportunity to enhance this optimizer to handle certain edge cases where the optimizer picked a poor plan. It does a better job at handling certain scenarios with partitioned tables, subqueries, and common table expressions.

    Orca should not be used in Production in release 4.3. To use it, you need to be part of the Early Access Program (EAP). It is a very simple command to enable it but we want customers to provide feedback before you make Orca the default optimizer. Contact your Pivotal Field Engineer to get into this program and if you don’t know who this is, contact me and I’ll help you.

    Summary
    Greenplum database 4.3 is a great release with many new features. Customers can download it now from emc.subscribenet.com and documentation from support.emc.com.