DevelopBI: hadoop

Showing posts with label hadoop. Show all posts

Saturday, February 1, 2014

Spoilt for Choice – How to choose the right Big Data / Hadoop Platform?

Big data becomes a relevant topic in many companies this year. Although there is no standard definition of the term „big data“, Hadoop is the de facto standard for processing big data. Almost all big software vendors such as IBM, Oracle, SAP, or even Microsoft use it. However, when you have decided to use Hadoop, the first question is how to start and which product to choose for your big data processes. Several alternatives exist for installing a version of Hadoop and realizing big data processes. This article discusses different alternatives and recommends when to use which one.

Hadoop related projects and frameworks

“Big-data” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.

SQL is what’s next for Hadoop: Here’s who’s doing it.

SUMMARY:

More and more companies and open source projects are trying to let users run SQL queries from inside Hadoop itself. Here’s a list of what’s available and, on a high level, how they work.

Football zero, Apache Pig hero – the essence from hundreds of posts from Apache Pig user mailing list.

I am big fan of football and I really like reading football news. Last week however, I definitely overdid reading it (because Poland played against England in the World Cup 2014 qualifying match). Hopefully, I did realize that it is not the best way to waste my time and today I decided that my next 7 days will be different. Instead, I will read posts from Apache Pig user mailing lists!

The idea is just to read post from the mailing list anytime I feel like reading about football. It means Football zero, Apache Pig hero for me this week ;)

Ganglia configuration for a small Hadoop cluster and some troubleshooting.

Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. You can see Ganglia in action at UC Berkeley Grid.

Ganglia is also a popular solution for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes written by a particular HDSF datanode over time, the block cache hit ratio for a given HBase region server, the total number of requests to the HBase cluster, time spent in garbage collection and many, many others.

The Hadoop Distributed File System.

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 40 petabytes of enterprise data at Yahoo!

24 Interview Questions & Answers for Hadoop MapReduce developers

A good understanding of Hadoop Architecture is required to understand and leverage the power of Hadoop. Here are few important practical questions which can be asked to a Senior Experienced Hadoop Developer in an interview. This list primarily includes questions related to Hadoop Architecture, MapReduce, Hadoop API and Hadoop Distributed File System (HDFS).

Big Data Top Questions by Marketers and their Kids Infographic 2013

What are the top questions marketers ask about their Big Data and how are they similar to their kids’ questions? Here’s a tongue-in-cheek look at how their questions are similar. See the below infographic from Infochimps via Visual.ly.

Hadoop in comics.

Hadoop >> HDFS in comics.
very nice

Big Data Analytics with Hadoop

A good presentation, it is helpfull from level of beginers to advance...

Big Data Analytics with Hadoop from Philippe Julio

Sunday, April 28, 2013

Installing Hadoop on Ubuntu (12.04) - single node

--Installing Java

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update

sudo apt-get install oracle-java7-installer

--Creating user

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser

Intro to Hadoop.

What is hadoop

Data is growing exponentially. What’s not so clear is how to unlock the value it holds. Hadoop is the answer. Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. Hadoop is written in the Java programming language. Hadoop was derived from Google's Map Reduce and Google File System (GFS) papers.

Google’s MapReduce provides:

Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring

Integrating Hadoop into Business Intelligence and Data Warehousing: An Overview in 27 Tweets.

To help you better understand how Hadoop can be integrated into business intelligence (BE) and data warehousing (DW) and why you should care, I’d like to share with you the series of 27 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of these issues and best practices in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from my recent TDWI report “Integrating Hadoop in Business Intelligence and Data Warehousing.” Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

Hadoop Interview Question

1.What is Hadoop framework?

Answer:

Hadoop is a open source framework which is written in java by apache software foundation. This framework is used to write software application which requires to process vast amount of data (It could handle multi tera bytes of data). It works in-parallel on large clusters which could have 1000 of computers (Nodes) on the clusters. It also process data very reliably and fault-tolerant manner.

2.On What concept the Hadoop framework works?

Answer:

It works on MapReduce, and it is devised by the Google.

3.What is MapReduce ?

Understanding Hadoop Clusters and the Network

This article is Part 1 in series that will take a closer look at the architecture and methods of a Hadoop cluster, and how it relates to the network and server infrastructure. The content presented here is largely based on academic work and conversations I’ve had with customers running real production clusters. If you run production Hadoop clusters in your data center, I’m hoping you’ll provide your valuable insight in the comments below. Subsequent articles to this will cover the server and network architecture options in closer detail.

Preface

Amount of data stored in database/files is growing every day, using this fact there become a need to build cheaper, mainatenable and scalable environments capable of storing big amounts of data („Big Data“). Conventional RDBMS systems became too expensive and not scalable based on today’s needs, so it is time to use/develop new techinques that will be able to satisfy our needs.

One of the technologies that lead in these directions is Cloud computing. There are different implementation of Cloud computing but we selected Hadoop – MapReduce framework with Apache licence based on Google Map Reduce framework.

Saturday, February 1, 2014

Spoilt for Choice – How to choose the right Big Data / Hadoop Platform?

Hadoop related projects and frameworks

Saturday, May 25, 2013

SQL is what’s next for Hadoop: Here’s who’s doing it.

Wednesday, May 15, 2013