Monthly Archives: December 2015

Greenplum Single Node Installation

Step 1
Download a CentOS 6 VM from http://virtual-machine.org/.

Step 2
Download the latest Greenplum binaries for RedHat Enterprise Linux 6 from http://network.pivotal.io
GPDB Download

Step 3
Start the Virtual Machine with VMWare Fusion or something similar.
Memory: 8GB
Cores: 4
Disk: 50GB
You can use less memory and cores but the more you provide the VM, the better it will perform. You might have to expand the VM disk space when using the VirtualMachine.org VM.

Step 4
Configure the operating system.

hostname gpdbsne

Add the hostname to /etc/sysconfig/network too.

Turn off firewalls.

chkconfig iptables off
service iptables stop
echo 0 >/selinux/enforce
vi /etc/selinux/config
SELINUX=disabled
setenforce 0

Edit the /etc/hosts file.

echo "127.0.0.1 gpdbsne gpdbsne.localdomain" >> /etc/hosts

I also like to get the ip address for this host and add it to my local /etc/hosts file.

ifconfig

Install unzip, ed, ntp and ssh.

yum install ntp
yum install unzip
yum install openssh-clients
yum install ed
chkconfig ntpd on
ntpdate pool.ntp.org
/etc/init.d/ntpd start

Add the following to the end of your /etc/sysctl.conf file.

kernel.shmmax = 500000000
kernel.shmmni = 4096
kernel.shmall = 4000000000
kernel.sem = 250 512000 100 2048
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048
net.ipv4.tcp_syncookies = 1
net.ipv4.ip_forward = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.ipv4.ip_local_port_range = 1025 65535
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
vm.overcommit_memory = 0

Add this to your /etc/security/limits.conf file.

* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072

Remove all lines from /etc/security/limits.d/90-nproc.conf

echo "" > /etc/security/limits.d/90-nproc.conf

There are some other configuration changes you make on a real cluster that involves XFS filesystem but for a SNE, this can be skipped. This is just intended for Development and Testing purposes.

Restart the VM so these changes take affect.

shutdown -r now

Step 5
Copy the installer to the VM.

scp greenplum-db-4.3.6.2-build-1-RHEL5-x86_64.zip root@gpdbsne:/root/

Step 6
ssh to the VM and run the installer.

ssh root@gpdbsne
unzip greenplum-db-4.3.6.2-build-1-RHEL5-x86_64.zip
/bin/bash greenplum-db-4.3.6.2-build-1-RHEL5-x86_64.bin

--Accept the license agreement
--Accept default installation directory

Step 7
For a multi-node cluster, the next step is to use gpseginstall but this isn’t needed with a single node installation. Instead, you have to manually create the gpadmin account and get the cluster ready for the next step.

useradd gpadmin
passwd gpadmin
chown -R gpadmin:gpadmin /usr/local/greenplum-db-4.3.6.2/
mkdir -p /data/master
mkdir /data/primary
chown -R gpadmin:gpadmin /data
su - gpadmin
echo "source /usr/local/greenplum-db/greenplum_path.sh" >> .bashrc
echo "export MASTER_DATA_DIRECTORY=/data/master/gpseg-1" >> .bashrc
source .bashrc
echo "gpdbsne" > hostfile
gpssh-exkeys -f hostfile

Step 8
Create an initialize file called gp_init_config so you can initialize the database.

ARRAY_NAME="Greenplum"
MACHINE_LIST_FILE=./hostfile
SEG_PREFIX=gpseg
PORT_BASE=50000
declare -a DATA_DIRECTORY=(/data/primary /data/primary )
MASTER_HOSTNAME=gpdbsne
MASTER_DIRECTORY=/data/master
MASTER_PORT=5432
TRUSTED_SHELL=ssh
CHECK_POINT_SEGMENTS=8
ENCODING=UNICODE

Step 9
Initialize the database.

gpinitsystem -c ~/gp_init_config
--select Y to continue

Now create the default database and configure the database to allow external connections.

psql -c "create database gpadmin" template1
psql -c "alter user gpadmin password 'changeme'"
echo "host all all 0.0.0.0/0 md5" >> /data/master/gpseg-1/pg_hba.conf
gpstop -u

Complete!
Now you can connect to Greenplum with the gpadmin account and the password is changeme. The default port is 5432. A great client tool is pgAdmin III v1.14.3.
http://www.postgresql.org/ftp/pgadmin3/release/v1.14.3/

Dear Cloudera, Please Stop!

So I was alerted to this blog post by a colleague and I was floored on what Cloudera is doing.

Basically, they are creating a new feature for objects that can store multiple levels in a single table. In a traditional relational structure, you may have an Orders table with another table for Order_Details. You would simply JOIN the two tables together when needed.

CREATE SCHEMA example;

CREATE TABLE example.Orders
(order_id int NOT NULL,
 order_date date NOT NULL,
 customer_id int NOT NULL);

CREATE TABLE example.Order_Details
(order_detail_id int NOT NULL,
 order_id int NOT NULL,
 product_id int NOT NULL,
 quantity int NOT NULL,
 price numeric NOT NULL);

And when you want to query both tables:

SELECT o.order_date, sum(od.price)
FROM example.orders o 
JOIN example.order_details od ON o.order_id = od.order_id
GROUP BY o.order_date;

What Cloudera is doing is an Object approach which combines both tables into a single table using a feature called a STRUCT. This is similar to a composite type in Greenplum, PostgreSQL, and HAWQ.

--Impala
CREATE TABLE Orders
(order_id int NOT NULL,
 order_date date NOT NULL,
 customer_id int NOT NULL,
 Order_Details ARRAY<STRUCT<
    order_detail_id: int,
    product_id: int,
    quantity: int,
    price: numeric>>
);

I get this approach when working with a OLTP system and developers want to match their Classes to the database structure but I really don’t see the benefit of this in a big data platform. It just makes it difficult for Database Users to understand and use the data.

Schema.Table
Every database I’ve ever worked with uses Schema.Table to organize table structures but not Impala! Cloudera has decided to break away from the decades old standard and use the dot notation for their nested fields stored within a STRUCT. If you adopt this silliness, you are adopting a standard that no one else uses. Why would Cloudera want that? Vendor lock-in maybe????

--Impala
select * from orders.order_details;

Orders isn’t the schema in the above example. This is a table. Cloudera just broke SQL!

Cartesian Product? Nope, Just Vendor Specific Syntax
Notice how this query appears to reference two different tables without specifying a join.

SELECT o.order_date, sum(od.price)
FROM orders o, o.order_details od 
GROUP BY o.order_date;

They even allow you to do an OUTER JOIN to the STRUCT ARRAY without defining the columns to join on.

SELECT o.order_date, sum(od.price)
FROM orders o
LEFT OUTER JOIN o.order_details od 
GROUP BY o.order_date;

Reasons Why Cloudera is Doing This
Here are my guesses as to why they are doing this.
1. They hate SQL or just don’t understand it.
2. Vendor Lock-In. They need ways to make it difficult for customers to switch to a competitor’s product.
3. Impala is bad at joins. Organizing tables like this make it easier for Impala’s query optimizer to create a robust query plan.

Please Cloudera, just stop!