DevelopBI: Big Data

Showing posts with label Big Data. Show all posts

Saturday, February 1, 2014

Spoilt for Choice – How to choose the right Big Data / Hadoop Platform?

Big data becomes a relevant topic in many companies this year. Although there is no standard definition of the term „big data“, Hadoop is the de facto standard for processing big data. Almost all big software vendors such as IBM, Oracle, SAP, or even Microsoft use it. However, when you have decided to use Hadoop, the first question is how to start and which product to choose for your big data processes. Several alternatives exist for installing a version of Hadoop and realizing big data processes. This article discusses different alternatives and recommends when to use which one.

Hadoop related projects and frameworks

“Big-data” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.

SQL is what’s next for Hadoop: Here’s who’s doing it.

SUMMARY:

More and more companies and open source projects are trying to let users run SQL queries from inside Hadoop itself. Here’s a list of what’s available and, on a high level, how they work.

Vertica loading best practices

Vertica loading best practices from Zvika Gutkin

Vertica mpp columnar dbms

Vertica mpp columnar dbms from Zvika Gutkin

Thursday, May 23, 2013

SQL, NoSQL, BigData in Data Architecture

All about how to build "Data Architecture" using SQL, NoSQL and BigData technologies and how to evaluate them.

SQL, NoSQL, BigData in Data Architecture from Venu Anuganti

Predictive Analytics is a Goldmine for Startups.

from Predictive Analytics book by Eric Siegel

Traditional business intelligence (and data mining) software does a very good job of showing you where you’ve been. By contrast, predictive analytics uses data patterns to make forward-looking predictions that guide you to where you should go next. This is a whole new world for startups seeking enterprise application opportunities, as well social media trend challenges.

The Hadoop Distributed File System.

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 40 petabytes of enterprise data at Yahoo!

24 Interview Questions & Answers for Hadoop MapReduce developers

A good understanding of Hadoop Architecture is required to understand and leverage the power of Hadoop. Here are few important practical questions which can be asked to a Senior Experienced Hadoop Developer in an interview. This list primarily includes questions related to Hadoop Architecture, MapReduce, Hadoop API and Hadoop Distributed File System (HDFS).

Big Data Top Questions by Marketers and their Kids Infographic 2013

What are the top questions marketers ask about their Big Data and how are they similar to their kids’ questions? Here’s a tongue-in-cheek look at how their questions are similar. See the below infographic from Infochimps via Visual.ly.

Big Data Analytics with Hadoop

A good presentation, it is helpfull from level of beginers to advance...

Big Data Analytics with Hadoop from Philippe Julio

Monday, April 29, 2013

Large-Scale Processing in Netezza.

Transitioning from ETL to ELT

CIO: Why is that uber-powered [commodity RDBMS] system running out of steam? Didn’t we just upgrade?

MANAGER: Yes, but the upgrade didn’t take.

CIO: Didn’t take? Sounds like a doctor transplanting an organ. Do you mean the CPUs rejected it? (laughing)

MANAGER: (soberly) No, just the users. Still too slow.

CIO: That hardware plant cost us [X] million dollars and it had better get it done or I’ll dismantle it for parts. I might dismantle your prima-donna architects with it!

Installing Hadoop on Ubuntu (12.04) - single node

--Installing Java

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update

sudo apt-get install oracle-java7-installer

--Creating user

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser

Intro to Hadoop.

What is hadoop

Data is growing exponentially. What’s not so clear is how to unlock the value it holds. Hadoop is the answer. Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. Hadoop is written in the Java programming language. Hadoop was derived from Google's Map Reduce and Google File System (GFS) papers.

Google’s MapReduce provides:

Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring

Integrating Hadoop into Business Intelligence and Data Warehousing: An Overview in 27 Tweets.

To help you better understand how Hadoop can be integrated into business intelligence (BE) and data warehousing (DW) and why you should care, I’d like to share with you the series of 27 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of these issues and best practices in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from my recent TDWI report “Integrating Hadoop in Business Intelligence and Data Warehousing.” Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

Hadoop Interview Question

1.What is Hadoop framework?

Answer:

Hadoop is a open source framework which is written in java by apache software foundation. This framework is used to write software application which requires to process vast amount of data (It could handle multi tera bytes of data). It works in-parallel on large clusters which could have 1000 of computers (Nodes) on the clusters. It also process data very reliably and fault-tolerant manner.

2.On What concept the Hadoop framework works?

Answer:

It works on MapReduce, and it is devised by the Google.

3.What is MapReduce ?

Understanding Hadoop Clusters and the Network

This article is Part 1 in series that will take a closer look at the architecture and methods of a Hadoop cluster, and how it relates to the network and server infrastructure. The content presented here is largely based on academic work and conversations I’ve had with customers running real production clusters. If you run production Hadoop clusters in your data center, I’m hoping you’ll provide your valuable insight in the comments below. Subsequent articles to this will cover the server and network architecture options in closer detail.

Preface

Amount of data stored in database/files is growing every day, using this fact there become a need to build cheaper, mainatenable and scalable environments capable of storing big amounts of data („Big Data“). Conventional RDBMS systems became too expensive and not scalable based on today’s needs, so it is time to use/develop new techinques that will be able to satisfy our needs.

One of the technologies that lead in these directions is Cloud computing. There are different implementation of Cloud computing but we selected Hadoop – MapReduce framework with Apache licence based on Google Map Reduce framework.

Making Big Data and BI Work Together

For enterprise IT and the end-users it supports, the interplay between big data and B.I. can prove as exciting as it is frustrating.

As enterprise executives and end-users eagerly look to gain meaningful intelligence and fast time-to-insight from deep wells of rich data—enabling them to react more quickly and intelligently to market conditions, deliver better customer service, streamline internal operations, and differentiate the organization from among the competition—IT is charged with facilitating such desires for agility even as rivers of data continue to pour into the organization.

With storage costs low enough to easily and cost-effectively store vast amounts of data, many IT organizations opt to store virtually everything they can. While that satiates some of the desires demanded by end-users, it increases the pressure on the makers of B.I. tools to create offerings robust enough to make meaningful, quick, and accurate sense of all available data.