DevelopBI: Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

A common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and

averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.

In this post, we’re going to install 5 popular databases on Linux Ubuntu (12.04):

MySQL / MariaDB 10.0: Row based database
MongoDB 2.4: NoSQL database
Vertica Community Edition 6: Columnar database (similar to Infobright, InfiniDB, …)
Hive 0.10: Datawarehouse built on top of HDFS using Map/Reduce
Impala 1.0: Database implemented on top of HDFS (compatible with Hive) based on Dremel that can use different data formats (raw CSV format, Parquet columnar format, …)

Then we’ll provide some scripts to populate them with some test data, run some simple aggregation queries and measure the response time. The tests will be run on only one box without any tuning using a relatively small dataset (160 million rows) but we’re planning on running more thorough tests in the cloud later with much bigger datasets (billions of rows). This is just to give a general idea on the performance of each of the database.

Installation

Installing MySQL/MariaDB

Go to https://downloads.mariadb.org/mariadb/repositories/ and follow instructions:

$ sudo apt-get install python-software-properties
$ sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 0xcbcb082a1bb943db sudo add-apt-repository 'deb http://ftp.osuosl.org/pub/mariadb/repo/10.0/ubuntu precise main'
$ sudo apt-get update sudo apt-get install mariadb-server

To start it:

$ sudo service mysql start

Installing MongoDB

Download the latest version from http://www.mongodb.org/downloads. Uncompress the archive:

$ cd /usr/local
$ sudo tar xvfz /tmp/mongodb-linux-x86_64-2.4.3.tgz
$ sudo ln -s mongodb-linux-x86_64-2.4.3 mongo

Edit the file /etc/profile and add mongo server to the PATH:

export PATH=/usr/local/mongo:$PATH

To start it:

$ sudo /usr/local/mongo/bin/mongod --fork --logpath=/var/log/mongodb.log

Installing Vertica

Go to: http://www.vertica.com

Create a login / password and follow instruction. And then go to: https://my.vertica.com/download-community-edition/

Under Download HP Vertica Analytic Database Server Community Edition, Click on Debian Linux 5/6.

Under Download HP Vertica Client Packages for the Community Edition, Click on Linux 32-bit.

Once this is downloaded, install the package as follows:

$ sudo apt-get install ssh sysstat pstack
$ sudo dpkg -i vertica-ce_6.0.1-0_amd64.deb
$ sudo tar xvfz vertica-client-6.0.1-0.x86_64.tar.gz -C /

This will uncompress the Vertica client in /opt/vertica.

Also, there will be some issues with the Debian version that we can fix following the tips given in a Youtube Videoprovided by דניאל ליבוביץ

Basically the file /opt/vertica/oss/python/lib/python2.7/site-packages/vertica/network/SystemProfileFactory.py is expecting to read a debian version number (4, 5 or 6) in your /etc/debian_version but on Ubuntu you don’t have a version number. So either you add a number in /etc/debian_version (e.g., 6 wheezy/sid) or you edit the file /opt/vertica/oss/python/lib/python2.7/site-packages/vertica/network/SystemProfileFactory.py and locate the following lines:

        # Debian 4 and 5 Support
        (status, res) = ssh.execute("[ -f /etc/debian_version ]", hide=True)
        if (res[0] == '0'):
            (status, res) = ssh.execute("grep \"^4.\" /etc/debian_version",

Replace the line:

            (status, res) = ssh.execute("grep \"^4.\" /etc/debian_version",

by:

            (status, res) = ssh.execute("grep \"^.\" /etc/debian_version",

which tells the script that it will always be ok with your Debian version.

There is also another problem described in this thread that you need to fix. So basically edit the file /opt/vertica/bin/validators.sh. And replace the line:

dpkg -p vertica 2>&1 > /dev/null

by:

dpkg -p vertica > /dev/null 2>&1

Finally run the install script:

$ sudo /opt/vertica/sbin/install_vertica
$ su - dbadmin
$ /opt/vertica/bin/adminTools

Configuration Menu and Create database (Database name: test)
For our example, we’ll assume you use the password “password”.

Next time, when you need to restart the database, you can use the same interface to do so.

Edit the file /etc/profile and add the following line:

export PATH=/opt/vertica/bin:$PATH

Installing Hadoop

Install Java 1.6 and 1.7 from the Oracle Java Site.

Unpack them as follows:

$ mkdir /opt/java
$ cd /opt/java
$ sudo sh ~/Downloads/jdk-6u45-linux-x64.bin
$ sudo tar xvfz ~/Downloads/jdk-7u21-linux-x64.tar.gz
$ sudo ln -s jdk-6u45-linux-x64 default

Edit the file /etc/default and append the following lines at the end:

export JAVA_HOME=/opt/java/default
export PATH=$JAVA_HOME/bin:$PATH

$ cd /etc/apt/sources.list.d/
$ sudo wget http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/cloudera.list -O archive-cloudera.list
$ sudo sh -c "wget http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key -O - | apt-key add -"
$ sudo apt-get update
$ sudo apt-get install hadoop hadoop-conf-pseudo hadoop-hdfs hadoop-hdfs-datanode hadoop-hdfs-namenode hadoop-mapreduce hadoop-yarn hadoop-yarn-nodemanager hadoop-yarn-resourcemanager

Edit the file /etc/hadoop/conf/hdfs-site.xml and add the following lines:

dfs.permissions

false

Whether we consider permissions or not

dfs.client.file-block-storage-locations.timeout

3000

dfs.client.read.shortcircuit

true

We disable permission for the sake of simplicity. The last 2 properties will be used by Impala.

You’ll also need to authorize SSH access to the box without entering your password:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

To start Hadoop:

$ sudo service hadoop-hdfs-namenode start
$ sudo service hadoop-hdfs-datanode start
$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start

Installing Hive

Download Hive from http://www.apache.org/dyn/closer.cgi/hive/.

Untar it:

$ tar xvfz hive-0.10.0-bin.tar.gz -C /usr/local
$ sudo ln -s /usr/local/hive-0.10.0-bin /usr/local/hive

Let’s configure Hive so that metadata are stored in MySQL.
Create the file $HIVE_HOME/conf/hive-site.xml with the following content:

<!--?xml version="1.0"?-->

<!--?xml-stylesheet type="text/xsl" href="configuration.xsl"?-->

hive.metastore.local

  true

javax.jdo.option.ConnectionURL

  jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true

javax.jdo.option.ConnectionDriverName

  com.mysql.jdbc.Driver

javax.jdo.option.ConnectionUserName

  hive

javax.jdo.option.ConnectionPassword

  hive

datanucleus.transactionIsolation

  repeatable-read

datanucleus.valuegeneration.transactionIsolation

  repeatable-read

Download the MySQL JDBC driver from http://dev.mysql.com/downloads/connector/j (using the MariaDB JDBC driver won’t work well with Hive).
Untar it and copy the jar file to $HIVE_HOME/lib:

$ tar xvfz mysql-connector-java-5.1.25.tar.gz
$ cp mysql-connector-java-5.1.25/mysql-connector-java-5.1.25-bin.jar $HIVE_HOME/lib

In MariaDB, create a user test:

$ mysql -u root -p
MariaDB [(none)]> CREATE DATABASE hive;
MariaDB [(none)]> GRANT ALL PRIVILEGES ON hive.* TO hive@localhost IDENTIFIED BY 'hive'
MariaDB [(none)]> GRANT ALL PRIVILEGES ON hive.* TO hive@'%' IDENTIFIED BY 'hive'

Edit your file /etc/profile with the following lines:

$ export HIVE_HOME=/usr/local/hive
$ export PATH=$PATH:$HIVE_HOME/bin

Installing Impala

We tried using the repo provided by Cloudera but got some issues when running the server.
Here what we did:

$ cd /etc/apt
$ sudo wget http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala/cloudera.list -O cloudera-impala.list
$ sudo apt-get update
$ sudo apt-get install sudo apt-get install impala impala-server impala-shell

But then when we start the service, the process will crash with a Segmentation Fault (Java 1.7) and another error with Java 1.6 (/usr/lib/impala/sbin/impalad: error while loading shared libraries: libjvm.so: wrong ELF class: ELFCLASS32).

So we decided to build Impala from the source in GitHub using some instruction taken fromhttps://users.soe.ucsc.edu/~ivo//blog/2013/04/03/building-impala-on-ubuntu-12.04/ and from the Impala GitHub page.

First install some packages:

$ sudo apt-get install \
     build-essential automake libtool flex bison \
     git subversion \
     unzip \
     libboost-test-dev libboost-program-options-dev libboost-filesystem-dev libboost-system-dev \
     libboost-regex-dev libboost-thread-dev \
     protobuf-compiler \
     libsasl2-dev \
     libbz2-dev \
     libevent1-dev \
     pkg-config \
     doxygen

Install LLVM:

$ wget http://llvm.org/releases/3.2/llvm-3.2.src.tar.gz
$ tar xvzf llvm-3.2.src.tar.gz
$ cd llvm-3.2.src/tools
$ svn co http://llvm.org/svn/llvm-project/cfe/tags/RELEASE_32/final/ clang
$ cd ../projects
$ svn co http://llvm.org/svn/llvm-project/compiler-rt/tags/RELEASE_32/final/ compiler-rt
$ cd ..
$ ./configure --with-pic --prefix=$HOME/.opt/
$ make -j4 REQUIRES_RTTI=1
$ make install

For some reason, we managed to make Impala works by compiling it with Java 1.7 and then run it with Java 1.6.

Install Maven:

$ wget http://www.fightrice.com/mirrors/apache/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz
$ sudo tar xvf apache-maven-3.0.4.tar.gz && sudo mv apache-maven-3.0.4 /usr/local
$ sudo ln -s /usr/local/apache-maven-3.0.4 /usr/local/maven

Edit the file /etc/profile and add the following lines:

export JAVA_HOME=/opt/java/default
export MVN_HOME=/usr/local/maven
export PATH=$JAVA_HOME/bin:$MVN_HOME/bin:$PATH

Build Impala:

$ . /etc/profile
$ export JAVA_HOME=/opt/java/jdk1.7.0_21
$ git clone http://github.com/cloudera/impala
$ cd impala
$ . bin/impala-config.sh
$ cd thirdparty
$ ./download_thirdparty.sh
$ cd ..
$ ./build_public.sh -build_thirdparty

Edit the file bin/set-classpath.sh and configure the CLASSPATH as follows:

CLASSPATH=\
$IMPALA_HOME/fe/src/test/resources:\
$IMPALA_HOME/fe/target/classes:\
$IMPALA_HOME/fe/target/dependency:\
$IMPALA_HOME/fe/target/test-classes:\
${HIVE_HOME}/lib/datanucleus-core-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-enhancer-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-rdbms-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-connectionpool-2.0.3.jar:\
${HIVE_HOME}/conf:\
${HIVE_HOME}/lib/mysql-connector-java-5.1.25-bin.jar:\
/etc/hadoop/conf

Running the tests

Data Generator

We wrote a simple data generator that creates data inspired from advertising networks. It represents the number of impressions, number of clicks, revenue broken out by advertiser, publisher and hour.
In terms of data distribution we have a year of data with hourly data with 20 advertisers and 100 publishers resulting in 160 million rows.

First get the source code:

$ git clone http://github.com/chimpler/blog-mysql-vertica-mongodb-impala
$ cd blog-mysql-vertica-mongodb-impala/datagen
$ make
$ ./gendata > /tmp/datatest.csv

Note that it will take about 15 minutes to generate the 160 million rows.

Importing data in MySQL/MariaDB

$ sudo mysql -u root -p
MariaDB [(none)]> CREATE DATABASE test
MariaDB [(none)]> GRANT ALL PRIVILEGES ON test.* TO test@localhost IDENTIFIED BY 'test';
MariaDB [(none)]> GRANT ALL PRIVILEGES ON test.* TO test@'%' IDENTIFIED BY 'test';

Then use our import script:

$ import_mysql.sh test test test

It will basically create a new table analytics and import all the CSVs.

Importing data in Vertica

To use the Vertica shell:

/opt/vertica/bin/vsql -d test -U dbadmin -w password

Useful commands:

\d: see all the tables
\d analytics: describe table analytics
\timing: toggle timings on/off (you can put it in you $HOME/.vsqlrc to enable it by default)
\q: quit

You can run the script import_vertica.sh that will take care of creating the table analytics and importing the CSV data:

$ ./import_vertica.sh dbadmin password test

Importing data in Hive

Some useful commands:

show tables;: show tables
desc <table_name>;: description of the table
quit;: quit

You can run Hive as follows to get more logs:

$ hive -hiveconf hive.root.logger=INFO,console

You can run the script import_hive.sh to import the data into Hive:

./import_hive.sh

Importing data in Impala

Some useful commands:

show tables;: show tables
desc table_name;: description of the table
refresh: refresh schema from metastore
quit;: quit

The table analytics should be visible since we’re getting the metadata from the Hive metastore:

$ impala-shell
[localhost:21000] > refresh;
[localhost:21000] > show tables;
+-------------------+
| name              |
+-------------------+
| analytics         |
+-------------------+

We can query the table using Impala. However we can use Parquet to store the data in a columnar format to have better performance. Let’s create a Parquet table and import the data there:

$ ./import_impala_parquet.sh

That will create a new table analytics_parquet with the same data but using the Parquet format.

Some measures

We represent for each database the time to first run the query and the time to run the query a second time (cached).

Data Set:

160 million rows
12 months, 28 days per month, 24 hours
20 advertisers
100 publishers
imps, clicks and revenue metrics

We ran the following tests:

bTest 1: SELECT sum(imps) FROM analytics
Test 2: SELECT advertiser_id, AVG(clicks) FROM analytics WHERE ymdh BETWEEN ’2012-06-01 00:00:00′ AND ’2012-07-01 00:00:00′ GROUP BY advertiser_id
Test 3: SELECT AVG(1000 * revenue / imps) AS cpm FROM analytics WHERE ymdh > ’2012-12-01 00:00:00′ GROUP BY publisher_id ORDER BY cpm DESC LIMIT 10

We used the scripts test_all.sh that will run the tests on all the different databases on an AMD Phenom(tm) II X6 1045T Processor (2.7 GHz) with 8GB of RAM.

We obtained the following response times:

Database	Import Time	Indexing Time	Disk Space	Test 1	Test 2	Test 3
MariaDB	83m	224m	21GB	115s/0.1s	51s/0.1s	47s/0.1s
Vertica	15m	-	93GB	15.3s/1.3s	2.7s/0.5s	5.16s/3.25s
MongoDB	80m	175m	59GB	305s/305s	59s/47s	59s/52s
Hive	250s	-	59GB	252s/225s	381s/382s	404s/407s
Impala (CSV)	-	-	59GB	131s/119s	119s/110s	106s/109s
Impala (Parquet)	219s (from data already in HDFS)	-	2.8GB	12s/6s	62s/62s	51s/53s

For each test, we measure the response time for the first run and the second run so we can see if there is any cache use. So in our tests we can see that MySQL / MariaDB and Vertica use query caching.

Note that Hive performance can be improved by compressing data in order to reduce I/O time. Check this pagehttp://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-compression/ andhttp://www.saurabhnanda.com/2009/07/how-to-make-your-hive-cluster-blazingly.html for more information.

Conclusion

When looking strictly at numbers, Vertica is a clear winner. MySQL because of its row-based storage design is not well suited for aggregations on millions of rows but still performs reasonably well. MongoDB, despite its newAggregation Framework is not suited for this kind of aggregation either. As expected Hive performs poorly but this can be explained by the Map/Reduce overhead. Finally we can see that Impala is promising in the Hadoop world. Using raw CSV file format, it is much faster than Hive and performs quite well when using the columnar format Parquet. However when filtering is involved, it does not perform very well.

In this post, we didn’t consider any partitioning of the data and we ran our queries on only one box so we didn’t take advantage of parallel execution. Also we didn’t compare the databases when they’re under different loads and see how they degrade as we increase the number of concurrent connections. Moreover Impala went GA only a week ago and so we can expect more performance improvement in the next months to come. Other Dremel implementations such as HortonWorks Stinger and EMC HAWQ boast very fast response time so it will be interesting to compare them with Impala when they are publicly released.

source : http://chimpler.wordpress.com/2013/05/10/installing-and-comparing-mysqlmariadb-mongodb-vertica-hive-and-impala-part-1/ (very intersting blog)

Saturday, May 25, 2013

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)