Saturday, May 25, 2013

SQL is what’s next for Hadoop: Here’s who’s doing it.

SUMMARY:

More and more companies and open source projects are trying to let users run SQL queries from inside Hadoop itself. Here’s a list of what’s available and, on a high level, how they work.

Vertica loading best practices

Vertica loading best practices from Zvika Gutkin

Vertica mpp columnar dbms

Vertica mpp columnar dbms from Zvika Gutkin

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

A common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and

averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.

SQL, NoSQL, BigData in Data Architecture

All about how to build "Data Architecture" using SQL, NoSQL and BigData technologies and how to evaluate them.

SQL, NoSQL, BigData in Data Architecture from Venu Anuganti

Predictive Analytics is a Goldmine for Startups.

from Predictive Analytics book by Eric Siegel

Traditional business intelligence (and data mining) software does a very good job of showing you where you’ve been. By contrast, predictive analytics uses data patterns to make forward-looking predictions that guide you to where you should go next. This is a whole new world for startups seeking enterprise application opportunities, as well social media trend challenges.

Intro to NoSQL

What is NoSQL?

Relational databases were introduced into the 1970s to allow applications to store data through a standard data modeling and query language (Structured Query Language, or SQL). At the time, storage was expensive and data schemas were fairly simple and straightforward. Since the rise of the web, the volume of data stored about users, objects, products and events has exploded. Data is also accessed more frequently, and is processed more intensively – for example, social networks create hundreds of millions of customized, real-time activity feeds for users based on their connections' activities.

JAVA: Reading and writing text files.

When reading and writing text files:

it's often a good idea to use buffering (default size is 8K)
it's often possible to use references to abstract base classes, instead of references to specific concrete classes
there's always a need to pay attention to exceptions (in particular, IOException andFileNotFoundException)

Football zero, Apache Pig hero – the essence from hundreds of posts from Apache Pig user mailing list.

I am big fan of football and I really like reading football news. Last week however, I definitely overdid reading it (because Poland played against England in the World Cup 2014 qualifying match). Hopefully, I did realize that it is not the best way to waste my time and today I decided that my next 7 days will be different. Instead, I will read posts from Apache Pig user mailing lists!

The idea is just to read post from the mailing list anytime I feel like reading about football. It means Football zero, Apache Pig hero for me this week ;)

Ganglia configuration for a small Hadoop cluster and some troubleshooting.

Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. You can see Ganglia in action at UC Berkeley Grid.

Ganglia is also a popular solution for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes written by a particular HDSF datanode over time, the block cache hit ratio for a given HBase region server, the total number of requests to the HBase cluster, time spent in garbage collection and many, many others.

The Hadoop Distributed File System.

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 40 petabytes of enterprise data at Yahoo!

24 Interview Questions & Answers for Hadoop MapReduce developers

A good understanding of Hadoop Architecture is required to understand and leverage the power of Hadoop. Here are few important practical questions which can be asked to a Senior Experienced Hadoop Developer in an interview. This list primarily includes questions related to Hadoop Architecture, MapReduce, Hadoop API and Hadoop Distributed File System (HDFS).

Big Data Top Questions by Marketers and their Kids Infographic 2013

What are the top questions marketers ask about their Big Data and how are they similar to their kids’ questions? Here’s a tongue-in-cheek look at how their questions are similar. See the below infographic from Infochimps via Visual.ly.

Hadoop in comics.

Hadoop >> HDFS in comics.
very nice