Saturday, May 25, 2013

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)



A common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and

averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.

In this post, we’re going to install 5 popular databases on Linux Ubuntu (12.04):
  • MySQL / MariaDB 10.0: Row based database
  • MongoDB 2.4: NoSQL database
  • Vertica Community Edition 6: Columnar database (similar to Infobright, InfiniDB, …)
  • Hive 0.10: Datawarehouse built on top of HDFS using Map/Reduce
  • Impala 1.0:  Database implemented on top of HDFS (compatible with Hive) based on Dremel that can use different data formats (raw CSV format, Parquet columnar format, …)
Then we’ll provide some scripts to populate them with some test data, run some simple aggregation queries and measure the response time. The tests will be run on only one box without any tuning using a relatively small dataset (160 million rows) but we’re planning on running more thorough tests in the cloud later with much bigger datasets (billions of rows). This is just to give a general idea on the performance of each of the database.

Installation

Installing MySQL/MariaDB

Go to https://downloads.mariadb.org/mariadb/repositories/ and follow instructions:
$ sudo apt-get install python-software-properties
$ sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 0xcbcb082a1bb943db sudo add-apt-repository 'deb http://ftp.osuosl.org/pub/mariadb/repo/10.0/ubuntu precise main'
$ sudo apt-get update sudo apt-get install mariadb-server
To start it:
$ sudo service mysql start

Installing MongoDB

Download the latest version from http://www.mongodb.org/downloads. Uncompress the archive:
$ cd /usr/local
$ sudo tar xvfz /tmp/mongodb-linux-x86_64-2.4.3.tgz
$ sudo ln -s mongodb-linux-x86_64-2.4.3 mongo
Edit the file /etc/profile and add mongo server to the PATH:
export PATH=/usr/local/mongo:$PATH
To start it:
$ sudo /usr/local/mongo/bin/mongod --fork --logpath=/var/log/mongodb.log

Installing Vertica

Create a login / password and follow instruction. And then go to:  https://my.vertica.com/download-community-edition/
Under Download HP Vertica Analytic Database Server Community Edition, Click on Debian Linux 5/6.
Under Download HP Vertica Client Packages for the Community Edition, Click on Linux 32-bit.
Once this is downloaded, install the package as follows:
$ sudo apt-get install ssh sysstat pstack
$ sudo dpkg -i vertica-ce_6.0.1-0_amd64.deb
$ sudo tar xvfz vertica-client-6.0.1-0.x86_64.tar.gz -C /
This will uncompress the Vertica client in /opt/vertica.
Also, there will be some issues with the Debian version that we can fix following the tips given in a Youtube Videoprovided by דניאל ליבוביץ
Basically the file /opt/vertica/oss/python/lib/python2.7/site-packages/vertica/network/SystemProfileFactory.py is expecting to read a debian version number (4, 5 or 6)  in your /etc/debian_version but on Ubuntu you don’t have a version number. So either you add a number in /etc/debian_version (e.g., 6 wheezy/sid) or you edit the file /opt/vertica/oss/python/lib/python2.7/site-packages/vertica/network/SystemProfileFactory.py and locate the following lines:
        # Debian 4 and 5 Support
        (status, res) = ssh.execute("[ -f /etc/debian_version ]", hide=True)
        if (res[0] == '0'):
            (status, res) = ssh.execute("grep \"^4.\" /etc/debian_version",
Replace the line:
            (status, res) = ssh.execute("grep \"^4.\" /etc/debian_version",
by:
            (status, res) = ssh.execute("grep \"^.\" /etc/debian_version",
which tells the script that it will always be ok with your Debian version.
There is also another problem described in this thread that you need to fix. So basically edit the file /opt/vertica/bin/validators.sh. And replace the line:
dpkg -p vertica 2>&1 > /dev/null
by:
dpkg -p vertica > /dev/null 2>&1
Finally run the install script:
$ sudo /opt/vertica/sbin/install_vertica
$ su - dbadmin
$ /opt/vertica/bin/adminTools
Configuration Menu and Create database (Database name: test)
For our example, we’ll assume you use the password “password”.
Next time, when you need to restart the database, you can use the same interface to do so.
Edit the file /etc/profile and add the following line:
export PATH=/opt/vertica/bin:$PATH

Installing Hadoop

Install Java 1.6 and 1.7 from the Oracle Java Site.
Unpack them as follows:
$ mkdir /opt/java
$ cd /opt/java
$ sudo sh ~/Downloads/jdk-6u45-linux-x64.bin
$ sudo tar xvfz ~/Downloads/jdk-7u21-linux-x64.tar.gz
$ sudo ln -s jdk-6u45-linux-x64 default
Edit the file /etc/default and append the following lines at the end:
export JAVA_HOME=/opt/java/default
export PATH=$JAVA_HOME/bin:$PATH

$ cd /etc/apt/sources.list.d/
$ sudo wget http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/cloudera.list -O archive-cloudera.list
$ sudo sh -c "wget http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key -O - | apt-key add -"
$ sudo apt-get update
$ sudo apt-get install hadoop hadoop-conf-pseudo hadoop-hdfs hadoop-hdfs-datanode hadoop-hdfs-namenode hadoop-mapreduce hadoop-yarn hadoop-yarn-nodemanager hadoop-yarn-resourcemanager
Edit the file /etc/hadoop/conf/hdfs-site.xml and add the following lines:
1
2
3
4
5
6
7
8
9
dfs.permissions
false
Whether we consider permissions or not
 
dfs.client.file-block-storage-locations.timeout
3000
 
dfs.client.read.shortcircuit
true
We disable permission for the sake of simplicity. The last 2 properties will be used by Impala.
You’ll also need to authorize SSH access to the box without entering your password:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
To start Hadoop:
$ sudo service hadoop-hdfs-namenode start
$ sudo service hadoop-hdfs-datanode start
$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start

Installing Hive

Untar it:
$ tar xvfz hive-0.10.0-bin.tar.gz -C /usr/local
$ sudo ln -s /usr/local/hive-0.10.0-bin /usr/local/hive
Let’s configure Hive so that metadata are stored in MySQL.
Create the file $HIVE_HOME/conf/hive-site.xml with the following content:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--?xml version="1.0"?-->
<!--?xml-stylesheet type="text/xsl" href="configuration.xsl"?-->
 
hive.metastore.local
  true
javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
  hive
javax.jdo.option.ConnectionPassword
  hive
datanucleus.transactionIsolation
  repeatable-read
datanucleus.valuegeneration.transactionIsolation
  repeatable-read
Download the MySQL JDBC driver from http://dev.mysql.com/downloads/connector/j (using the MariaDB JDBC driver won’t work well with Hive).
Untar it and copy the jar file to $HIVE_HOME/lib:
$ tar xvfz mysql-connector-java-5.1.25.tar.gz
$ cp mysql-connector-java-5.1.25/mysql-connector-java-5.1.25-bin.jar $HIVE_HOME/lib
In MariaDB, create a user test:
$ mysql -u root -p
MariaDB [(none)]> CREATE DATABASE hive;
MariaDB [(none)]> GRANT ALL PRIVILEGES ON hive.* TO hive@localhost IDENTIFIED BY 'hive'
MariaDB [(none)]> GRANT ALL PRIVILEGES ON hive.* TO hive@'%' IDENTIFIED BY 'hive'
Edit your file /etc/profile with the following lines:
$ export HIVE_HOME=/usr/local/hive
$ export PATH=$PATH:$HIVE_HOME/bin

Installing Impala

We tried using the repo provided by Cloudera but got some issues when running the server.
Here what we did:
$ cd /etc/apt
$ sudo wget http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala/cloudera.list -O cloudera-impala.list
$ sudo apt-get update
$ sudo apt-get install sudo apt-get install impala impala-server impala-shell
But then when we start the service, the process will crash with a Segmentation Fault (Java 1.7) and another error with Java 1.6 (/usr/lib/impala/sbin/impalad: error while loading shared libraries: libjvm.so: wrong ELF class: ELFCLASS32).
So we decided to build Impala from the source in GitHub using some instruction taken fromhttps://users.soe.ucsc.edu/~ivo//blog/2013/04/03/building-impala-on-ubuntu-12.04/ and from the Impala GitHub page.
First install some packages:
$ sudo apt-get install \
     build-essential automake libtool flex bison \
     git subversion \
     unzip \
     libboost-test-dev libboost-program-options-dev libboost-filesystem-dev libboost-system-dev \
     libboost-regex-dev libboost-thread-dev \
     protobuf-compiler \
     libsasl2-dev \
     libbz2-dev \
     libevent1-dev \
     pkg-config \
     doxygen
Install LLVM:
$ wget http://llvm.org/releases/3.2/llvm-3.2.src.tar.gz
$ tar xvzf llvm-3.2.src.tar.gz
$ cd llvm-3.2.src/tools
$ svn co http://llvm.org/svn/llvm-project/cfe/tags/RELEASE_32/final/ clang
$ cd ../projects
$ svn co http://llvm.org/svn/llvm-project/compiler-rt/tags/RELEASE_32/final/ compiler-rt
$ cd ..
$ ./configure --with-pic --prefix=$HOME/.opt/
$ make -j4 REQUIRES_RTTI=1
$ make install
For some reason, we managed to make Impala works by compiling it with Java 1.7 and then run it with Java 1.6.
Install Maven:
$ wget http://www.fightrice.com/mirrors/apache/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz
$ sudo tar xvf apache-maven-3.0.4.tar.gz && sudo mv apache-maven-3.0.4 /usr/local
$ sudo ln -s /usr/local/apache-maven-3.0.4 /usr/local/maven
Edit the file /etc/profile and add the following lines:
export JAVA_HOME=/opt/java/default
export MVN_HOME=/usr/local/maven
export PATH=$JAVA_HOME/bin:$MVN_HOME/bin:$PATH
Build Impala:
$ . /etc/profile
$ export JAVA_HOME=/opt/java/jdk1.7.0_21
$ git clone http://github.com/cloudera/impala
$ cd impala
$ . bin/impala-config.sh
$ cd thirdparty
$ ./download_thirdparty.sh
$ cd ..
$ ./build_public.sh -build_thirdparty
Edit the file bin/set-classpath.sh and configure the CLASSPATH as follows:
CLASSPATH=\
$IMPALA_HOME/fe/src/test/resources:\
$IMPALA_HOME/fe/target/classes:\
$IMPALA_HOME/fe/target/dependency:\
$IMPALA_HOME/fe/target/test-classes:\
${HIVE_HOME}/lib/datanucleus-core-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-enhancer-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-rdbms-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-connectionpool-2.0.3.jar:\
${HIVE_HOME}/conf:\
${HIVE_HOME}/lib/mysql-connector-java-5.1.25-bin.jar:\
/etc/hadoop/conf

Running the tests

Data Generator

We wrote a simple data generator that creates data inspired from advertising networks. It represents the number of impressions, number of clicks, revenue broken out by advertiser, publisher and hour.
In terms of data distribution we have a year of data with hourly data with 20 advertisers and 100 publishers resulting in 160 million rows.
First get the source code:
$ git clone http://github.com/chimpler/blog-mysql-vertica-mongodb-impala
$ cd blog-mysql-vertica-mongodb-impala/datagen
$ make
$ ./gendata > /tmp/datatest.csv
Note that it will take about 15 minutes to generate the 160 million rows.

Importing data in MySQL/MariaDB

$ sudo mysql -u root -p
MariaDB [(none)]> CREATE DATABASE test
MariaDB [(none)]> GRANT ALL PRIVILEGES ON test.* TO test@localhost IDENTIFIED BY 'test';
MariaDB [(none)]> GRANT ALL PRIVILEGES ON test.* TO test@'%' IDENTIFIED BY 'test';
Then use our import script:
$ import_mysql.sh test test test
It will basically create a new table analytics and import all the CSVs.

Importing data in Vertica

To use the Vertica shell:
/opt/vertica/bin/vsql -d test -U dbadmin -w password
Useful commands:
  • \d: see all the tables
  • \d analytics: describe table analytics
  • \timing: toggle timings on/off (you can put it in you $HOME/.vsqlrc to enable it by default)
  • \q: quit
You can run the script import_vertica.sh that will take care of creating the table analytics and importing the CSV data:
$ ./import_vertica.sh dbadmin password test

Importing data in Hive

Some useful commands:
  • show tables;: show tables
  • desc <table_name>;: description of the table
  • quit;: quit
You can run Hive as follows to get more logs:
$ hive -hiveconf hive.root.logger=INFO,console
You can run the script import_hive.sh to import the data into Hive:
./import_hive.sh

Importing data in Impala

Some useful commands:
  • show tables;: show tables
  • desc table_name;: description of the table
  • refresh: refresh schema from metastore
  • quit;: quit
The table analytics should be visible since we’re getting the metadata from the Hive metastore:
$ impala-shell
[localhost:21000] > refresh;
[localhost:21000] > show tables;
+-------------------+
| name              |
+-------------------+
| analytics         |
+-------------------+
We can query the table using Impala. However we can use Parquet to store the data in a columnar format to have better performance. Let’s create a Parquet table and import the data there:
$ ./import_impala_parquet.sh
That will create a new table analytics_parquet with the same data but using the Parquet format.

Some measures

We represent for each database the time to first run the query and the time to run the query a second time (cached).
Data Set:
  • 160 million rows
  • 12 months, 28 days per month, 24 hours
  • 20 advertisers
  • 100 publishers
  • imps, clicks and revenue metrics
We ran the following tests:
  • bTest 1: SELECT sum(imps) FROM analytics
  • Test 2: SELECT advertiser_id, AVG(clicks) FROM analytics WHERE ymdh BETWEEN ’2012-06-01 00:00:00′ AND ’2012-07-01 00:00:00′ GROUP BY advertiser_id
  • Test 3: SELECT AVG(1000 * revenue / imps) AS cpm FROM analytics WHERE ymdh > ’2012-12-01 00:00:00′ GROUP BY publisher_id ORDER BY cpm DESC LIMIT 10
We used the scripts test_all.sh that will run the tests on all the different databases on an AMD Phenom(tm) II X6 1045T Processor (2.7 GHz) with 8GB of RAM.
We obtained the following response times:
db_comparison
DatabaseImport TimeIndexing TimeDisk SpaceTest 1Test 2Test 3
MariaDB83m224m21GB115s/0.1s51s/0.1s47s/0.1s
Vertica15m-93GB15.3s/1.3s2.7s/0.5s5.16s/3.25s
MongoDB80m175m59GB305s/305s59s/47s59s/52s
Hive250s-59GB252s/225s381s/382s404s/407s
Impala (CSV)--59GB131s/119s119s/110s106s/109s
Impala (Parquet)219s (from data already in HDFS)-2.8GB12s/6s62s/62s51s/53s
For each test, we measure the response time for the first run and the second run so we can see if there is any cache use. So in our tests we can see that MySQL / MariaDB and Vertica use query caching.
Note that Hive performance can be improved by compressing data in order to reduce I/O time. Check this pagehttp://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-compression/ andhttp://www.saurabhnanda.com/2009/07/how-to-make-your-hive-cluster-blazingly.html for more information.

Conclusion

When looking strictly at numbers, Vertica is a clear winner. MySQL because of its row-based storage design is not well suited for aggregations on millions of rows but still performs reasonably well. MongoDB, despite its newAggregation Framework is not suited for this kind of aggregation either. As expected Hive performs poorly but this can be explained by the Map/Reduce overhead. Finally we can see that Impala is promising in the Hadoop world. Using raw CSV file format, it is much faster than Hive and performs quite well when using the columnar format Parquet. However when filtering is involved, it does not perform very well.
In this post, we didn’t consider any partitioning of the data and we ran our queries on only one box so we didn’t take advantage of parallel execution. Also we didn’t compare the databases when they’re under different loads and see how they degrade as we increase the number of concurrent connections. Moreover Impala went GA only a week ago and so we can expect more performance improvement in the next months to come. Other Dremel implementations such as HortonWorks Stinger and EMC HAWQ boast very fast response time so it will be interesting to compare them with Impala when they are publicly released.

No comments:

Post a Comment