DevelopBI: Intro to Hadoop.

What is hadoop

Data is growing exponentially. What’s not so clear is how to unlock the value it holds. Hadoop is the answer. Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. Hadoop is written in the Java programming language. Hadoop was derived from Google's Map Reduce and Google File System (GFS) papers.

Google’s MapReduce provides:

Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring

Programming model for MapReduce:

Input & Output: each a set of key/value pairs

Programmer specifies two functions:

map (in_key, in_value) -> list(out_key, intermediate_value)

Processes input key/value pair
Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list(out_value)

Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)

So, Hadoop is designed as a distributed work manager for huge amounts of data on a large number of systems. But Hadoop is more than that in that it's also about monitoring, failover and scheduling.

On June 13, 2012 Facebook announced that they had the largest Hadoop cluster in the world with grown to 100 PB (Pentabyte).

Hadoop-related projects at Apache include:

· HBase: A scalable, distributed database that supports structured data storage for large tables.

· Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

· Pig: A high-level data-flow language and execution framework for parallel computation.

What can Hadoop do for you?

According to analysts, about 80% of the data in the world is unstructured, and until Hadoop, it was essentially unusable in any systematic way. With Hadoop, for the first time you can combine all your data and look at it as one

Hadoop can be used:

to ingest the data flowing into your systems 24/7 and leverage it to make optimizations
to make decisions based on hard data
to look at complete data
to look at years of transactions

Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures, audio files, communications records, email– just about anything you can think of.

Two Core Components of Hadoop:

Store:

HDFS (Hadoop distributed file system) - clustered storage, high bandwidth

HDFS Data Storage: HDFS has split the file into 64MB blocks and store it into DataNodes.

Process:

Map/ Reduce - distributed processing, fault – tolerant

Map functions runs on same node as data was stored. It contains key and value pair in bytes.

Each map is assigned to a reducer based on its key. Mapped output is grouped and sorted by the key.

Reducer returns the TextOutputFormat.

Hadoop is not NoSQL:

· Hive project adds support to it.

· HiveQL compiles to a query plan.

· Query Plan executes as MapReduce jobs.

Data preparation and presentation

Data processing often splits into three separate tasks: data collection, data preparation, and data presentation.

The data preparation phase is often known as ETL (Extract Transform Load) or data factory.

The data presentation phase is usually referred to as the data warehouse.

Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.

Data factory use cases

In our data factory we have observed three distinct workloads: pipelines, iterative processing, and research.

Pipelines

Data goes through various cleaning step and these transformations are best managed by Pig.

Iterative Processing

In this case, there is usually one very large data set that is maintained. Typical processing on that data set involves bringing in small new pieces of data that will change the state of the large data set.

For example, consider a data set that contained all the news stories currently known to Yahoo! News. You can envision this as a huge graph, where each story is a node. In this graph, there are links between stories that reference the same events. Every few minutes a new set of stories comes in, and the tools need to add these to the graph, find related stories and create links, and remove old stories that these new stories supersede.

What sets this off from the standard pipeline case is the constant inflow of small changes. These require the use of an incremental processing model to process this data in a reasonable amount of time.

For example, if the process has already done a join against the graph of all news stories, and a small set of new stories arrives, re-running the join across the whole set will not be desirable. It will take hours or days.

Instead, joining against the new incremental data and using the results together with the results from the previous full join is the correct approach. This will take only a few minutes. Standard database operations can be implemented in this incremental way in Pig Latin, making Pig a good tool for this use case.

Research

A third use case is research. Many scientists who use our grid tools to comb through the petabytes of data we have. Many of these researchers want to quickly write a script to test a theory or gain deeper insight.

But, in the data factory, data may not be in a nice, standardized state yet. This makes Pig a good fit for this use case as well, since it supports data with partial or unknown schemas, and semi-structured or unstructured data.

Pig integration with streaming also makes it easy for researchers to take a Perl or Python script they have already debugged on a small data set and run it against a huge data set.

The widespread use of Pig has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well.

Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse.

It will also enable to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware.

HBase:

HBase is the Hadoop database. Think of it as a distributed, scalable, big data store.

When Would I Use HBase?

Use HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

Since HBase and HyperTable are all about performance (being modeled on Google's BigTable), they sound like they would certainly be much faster than Hive, at the cost of functionality and a higher learning curve (e.g., they don't have joins or the SQL-like syntax).

Big data

1. DROP TABLE : Never gives error whether table is present or not

2. A partition name does not mean that it contains all or only data from that name; partitions are named for convenience but it is the user's job to guarantee the relationship between partition name and data content

3. Implicit conversion is allowed for types from child to an ancestor

Ø DOUBLE

Ø BIGINT

Ø INT

Ø TINYINT

Ø FLOAT

Ø STRING

Ø BOOLEAN

So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy above.

Apart from these fundamental rules for implicit conversion based on type system, Hive also allows the special case for conversion:

STRING to DOUBLE

4. Explicit conversion can be done using cast operators

5. Fields terminator should be a single char

a. If there are more chars, it will consider only first char

i. In case of ‘$|$’, terminator will be ‘$’

b. If number is specified as terminator, it will consider that as ASCII value of char

i. In case of ‘49’, terminator will be ‘1’

References:

http://www.cloudera.com/

http://apache.org/

http://www.slideshare.net/cloudera/tokyo-nosqlslidesonly