Wednesday, August 28, 2013

50 Open Source Replacements for Proprietary Business Intelligence Software.

In a recent Gartner survey, CIOs picked business intelligence and analytics as their top technology priority for 2012. The market research firm predicts that enterprises will spend more than $12 billion on business intelligence (BI), analytics and performance management software this year alone.

As the market for business intelligence solutions continues to grow, the open source community is responding with a growing number of applications designed to help companies store and analyze key business data. In fact, many of the best tools in the field are available under an open source license. And enterprises that need commercial support or other services will find many options available.

This month, we've put together a list of 50 of the top open source business intelligence tools that can replace proprietary solutions. It includes complete business intelligence platforms, data warehouses and databases, data mining and reporting tools, ERP suites with built-in BI capabilities and even spreadsheets. If we've overlooked any tools that you feel should be on the list, please feel free to note them in the comments section below.

Free BI Tools With Commercial Options

Business intelligence vendors are companies that provide data mining, data warehousing, and enterprise resource planning services. Some of the top vendors that offer free BI reporting tools and commercial services are: MicroStrategy, QlikView, Pentaho, JasperReports Library, Rapid-I, and Jedox.

Data Modeling ,moving from SQL to NoSQL in the enterprise lecture

Very interesting lecture about data modeling and moving MySQL to NoSQL

Summary
Kenneth M. Anderson shares some of the data modeling issues encountered while transitioning from a relational database to NoSQL.
http://www.infoq.com/presentations/MySQL-NoSQL-Data-Modeling

Wednesday, June 5, 2013

Projections in Vertica

Projections... You probably have a good idea of what that means already. Who remembers Plato's cave from high school? It's basically a group of people locked in a cave, staring at a blank wall all the time. All they see on that wall, are shadows of objects in the real world, projections if you will. Plato argued that, for these prisoners, these projections are as close as it gets to reality. However, people who reason about reality, and not just absorb it, free themselves from the cave. And can perceive reality as it really is. Not just its projections.

In a relational database, you typically have tables, containing your data and its relations. This is reality. If you want to see it from a particular angle, you can project your data into a view. A view might be a subset of columns of a table or a combination of some columns of one table, with some other columns of another table. These things exist in Vertica as well, and they are called projections. But it pushes this notion one step further. In Vertica, there are no tables, only projections. And a collection of projections can represent a table, or multiple tables.

So Vertica's idea of a projection is really Plato's cave turned inside-out. There is no reality. Only a collection of projections from which we can create that reality if we need to. Sounds familiar?

SQL is what’s next for Hadoop: Here’s who’s doing it.

SUMMARY:

More and more companies and open source projects are trying to let users run SQL queries from inside Hadoop itself. Here’s a list of what’s available and, on a high level, how they work.

Vertica loading best practices

Vertica loading best practices from Zvika Gutkin

Vertica mpp columnar dbms

Vertica mpp columnar dbms from Zvika Gutkin

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

A common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and

averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.

SQL, NoSQL, BigData in Data Architecture

All about how to build "Data Architecture" using SQL, NoSQL and BigData technologies and how to evaluate them.

SQL, NoSQL, BigData in Data Architecture from Venu Anuganti

Predictive Analytics is a Goldmine for Startups.

from Predictive Analytics book by Eric Siegel

Traditional business intelligence (and data mining) software does a very good job of showing you where you’ve been. By contrast, predictive analytics uses data patterns to make forward-looking predictions that guide you to where you should go next. This is a whole new world for startups seeking enterprise application opportunities, as well social media trend challenges.

Intro to NoSQL

What is NoSQL?

Relational databases were introduced into the 1970s to allow applications to store data through a standard data modeling and query language (Structured Query Language, or SQL). At the time, storage was expensive and data schemas were fairly simple and straightforward. Since the rise of the web, the volume of data stored about users, objects, products and events has exploded. Data is also accessed more frequently, and is processed more intensively – for example, social networks create hundreds of millions of customized, real-time activity feeds for users based on their connections' activities.

JAVA: Reading and writing text files.

When reading and writing text files:

it's often a good idea to use buffering (default size is 8K)
it's often possible to use references to abstract base classes, instead of references to specific concrete classes
there's always a need to pay attention to exceptions (in particular, IOException andFileNotFoundException)

Football zero, Apache Pig hero – the essence from hundreds of posts from Apache Pig user mailing list.

I am big fan of football and I really like reading football news. Last week however, I definitely overdid reading it (because Poland played against England in the World Cup 2014 qualifying match). Hopefully, I did realize that it is not the best way to waste my time and today I decided that my next 7 days will be different. Instead, I will read posts from Apache Pig user mailing lists!

The idea is just to read post from the mailing list anytime I feel like reading about football. It means Football zero, Apache Pig hero for me this week ;)

Ganglia configuration for a small Hadoop cluster and some troubleshooting.

Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. You can see Ganglia in action at UC Berkeley Grid.

Ganglia is also a popular solution for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes written by a particular HDSF datanode over time, the block cache hit ratio for a given HBase region server, the total number of requests to the HBase cluster, time spent in garbage collection and many, many others.

The Hadoop Distributed File System.

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 40 petabytes of enterprise data at Yahoo!

24 Interview Questions & Answers for Hadoop MapReduce developers

A good understanding of Hadoop Architecture is required to understand and leverage the power of Hadoop. Here are few important practical questions which can be asked to a Senior Experienced Hadoop Developer in an interview. This list primarily includes questions related to Hadoop Architecture, MapReduce, Hadoop API and Hadoop Distributed File System (HDFS).

Big Data Top Questions by Marketers and their Kids Infographic 2013

What are the top questions marketers ask about their Big Data and how are they similar to their kids’ questions? Here’s a tongue-in-cheek look at how their questions are similar. See the below infographic from Infochimps via Visual.ly.

Hadoop in comics.

Hadoop >> HDFS in comics.
very nice

Big Data Analytics with Hadoop

A good presentation, it is helpfull from level of beginers to advance...

Big Data Analytics with Hadoop from Philippe Julio

SAP Business Objects Data Services (BODS) Interview Questions with Answers

Learn the Answers of some critical questions commonly asked during SAP BO Data Services interview.

1. What is the use of BusinessObjects Data Services?

Answer:

BusinessObjects Data Services provides a graphical interface that allows you to easily create jobs that extract data from heterogeneous sources, transform that data to meet the business requirements of your organization, and load the data into a single location.

Time series analytics on Vertica

Gap Filling and Interpolation (GFI)

A Swiss-Army Knife for Time Series Analytics

Gap Filling and Interpolation (GFI) is a set of patent-pending time series analytics features in Vertica . In this post, through additional use cases, we will show that GFI can enable Vertica users in a wide range of industry sectors to achieve a diverse set of goals.

Rolling Average with Oracle or Vertica analytical functions.

This little example will demonstrate how to use Oracle's or Vertica's analytical functions to get the rolling average. First you have to create and load a table that contains each month's average temperature in Edinburgh in the years 1764-1820.

Transitioning from ETL to ELT

CIO: Why is that uber-powered [commodity RDBMS] system running out of steam? Didn’t we just upgrade?

MANAGER: Yes, but the upgrade didn’t take.

CIO: Didn’t take? Sounds like a doctor transplanting an organ. Do you mean the CPUs rejected it? (laughing)

MANAGER: (soberly) No, just the users. Still too slow.

CIO: That hardware plant cost us [X] million dollars and it had better get it done or I’ll dismantle it for parts. I might dismantle your prima-donna architects with it!

Enhanced Aggregation, Cube, Grouping and Rollup.

(OLAP reporting embedded in SQL)

Much of the OLAP reporting feature embedded in Oracle SQL is ignored. People turn to expensive OLAP reporting tools in the market - even for simple reporting needs. This article outlines some of the common OLAP reporting needs and shows how to meet them by using the enhanced aggregation features of Oracle SQL.

Analytic functions by Example.

This article provides a clear, thorough concept of analytic functions and its various options by a series of simple yet concept building examples. The article is intended for SQL coders, who for might be not be using analytic functions due to unfamiliarity with its cryptic syntax or uncertainty about its logic of operation. Often I see that people tend to reinvent the feature provided by analytic functions by native join and sub-query SQL. This article assumes familiarity with basic Oracle SQL, sub-query, join and group function from the reader. Based on that familiarity, it builds the concept of analytic functions through a series of examples.

Installing Hadoop on Ubuntu (12.04) - single node

--Installing Java

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update

sudo apt-get install oracle-java7-installer

--Creating user

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser

Intro to Hadoop.

What is hadoop

Data is growing exponentially. What’s not so clear is how to unlock the value it holds. Hadoop is the answer. Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. Hadoop is written in the Java programming language. Hadoop was derived from Google's Map Reduce and Google File System (GFS) papers.

Google’s MapReduce provides:

Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring

Predictive Analytics with Data Mining: How It Works.

by Eric Siegel, Ph.D.

Published in DM Review's DM Direct, February 2005.

Although you've probably heard many times that predictive analytics will optimize your marketing campaigns, it's hard to envision, in more concrete terms, what it will do. This makes it tough to select and direct analytics technology. How can you get a handle on its functional value for marketing, sales and product directions without necessarily becoming an expert?

Seven Principles for Enterprise Data Warehouse Design

In previous columns, I've talked about how you can improve the likelihood of achieving your desired results in building a data management center of excellence and in managing enterprise information. This month, I'd like to narrow the focus to one particular aspect of the enterprise information management spectrum: data warehouse (DW) design.

Contrary to popular sentiment, data warehousing is not a moribund technology; it's alive and kicking. Indeed, most companies deploy data warehousing technology to some extent, and many have an enterprise-wide DW.

Integrating Hadoop into Business Intelligence and Data Warehousing: An Overview in 27 Tweets.

To help you better understand how Hadoop can be integrated into business intelligence (BE) and data warehousing (DW) and why you should care, I’d like to share with you the series of 27 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of these issues and best practices in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from my recent TDWI report “Integrating Hadoop in Business Intelligence and Data Warehousing.” Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

DataStage 8.7 Online Training

Datastage Online Training

Monday, April 22, 2013

Hadoop Interview Question

1.What is Hadoop framework?

Answer:

Hadoop is a open source framework which is written in java by apache software foundation. This framework is used to write software application which requires to process vast amount of data (It could handle multi tera bytes of data). It works in-parallel on large clusters which could have 1000 of computers (Nodes) on the clusters. It also process data very reliably and fault-tolerant manner.

2.On What concept the Hadoop framework works?

Answer:

It works on MapReduce, and it is devised by the Google.

3.What is MapReduce ?

Understanding Hadoop Clusters and the Network

This article is Part 1 in series that will take a closer look at the architecture and methods of a Hadoop cluster, and how it relates to the network and server infrastructure. The content presented here is largely based on academic work and conversations I’ve had with customers running real production clusters. If you run production Hadoop clusters in your data center, I’m hoping you’ll provide your valuable insight in the comments below. Subsequent articles to this will cover the server and network architecture options in closer detail.

Preface

Amount of data stored in database/files is growing every day, using this fact there become a need to build cheaper, mainatenable and scalable environments capable of storing big amounts of data („Big Data“). Conventional RDBMS systems became too expensive and not scalable based on today’s needs, so it is time to use/develop new techinques that will be able to satisfy our needs.

One of the technologies that lead in these directions is Cloud computing. There are different implementation of Cloud computing but we selected Hadoop – MapReduce framework with Apache licence based on Google Map Reduce framework.

Making Big Data and BI Work Together

For enterprise IT and the end-users it supports, the interplay between big data and B.I. can prove as exciting as it is frustrating.

As enterprise executives and end-users eagerly look to gain meaningful intelligence and fast time-to-insight from deep wells of rich data—enabling them to react more quickly and intelligently to market conditions, deliver better customer service, streamline internal operations, and differentiate the organization from among the competition—IT is charged with facilitating such desires for agility even as rivers of data continue to pour into the organization.

With storage costs low enough to easily and cost-effectively store vast amounts of data, many IT organizations opt to store virtually everything they can. While that satiates some of the desires demanded by end-users, it increases the pressure on the makers of B.I. tools to create offerings robust enough to make meaningful, quick, and accurate sense of all available data.

Data Warehouse

As per Bill Inmon "A warehouse is a Historical, subject-oriented, integrated, time-variant and non-volatilecollection of data in support of management's decision making process".

By Historical we mean, the data is continuously collected from sources and loaded in the warehouse. The previously loaded data is not deleted for long period of time. This results in building historical data in the warehouse.

By Subject Oriented we mean data grouped into a particular business area instead of the business as a whole.

By Integrated we mean, collecting and merging data from various sources. These sources could be disparate in nature.

By Time-variant we mean that all data in the data warehouse is identified with a particular time period.
By Non-volatile we mean, data that is loaded in the warehouse is based on business transactions in the past, hence it is not expected to change over time.

Ralph Kimball Associates	Ralph Kimball Associates focuses on developing, teaching, and delivering dimensional data warehouse design techniques for the community of IT professionals.
The OLAP Report	The OLAP Report website is a vendor-independent, research-based source of information regarding analytical processing of information. It provides detailed, unbiased and regularly updated information on the OLAP market and OLAP products.
The Data Warehousing Institute	The Data Warehouse Institute (TDWI) is the premier provider of in-depth, high quality education and training in the data warehousing and business intelligence industry.
DM Review	The DM Review website is an excellent data warehousing resource focused on business intelligence.
Business Intelligence Network	The B-EYE-Network serves the business intelligence and data warehousing community with unparalleled industry coverage and resources. In response to the growing need for a more sophisticated online resource, the B-EYE-Network delivers industry based content hosted by domain experts and includes horizontal technology coverage from the most respected thought leaders in the BI and DW industry.

Wednesday, August 28, 2013

Monday, June 17, 2013

Wednesday, June 5, 2013

Saturday, May 25, 2013

Thursday, May 23, 2013

Wednesday, May 22, 2013

What is NoSQL?

Thursday, May 16, 2013

Wednesday, May 15, 2013

Saturday, May 4, 2013

Tuesday, April 30, 2013

Monday, April 29, 2013

Gap Filling and Interpolation (GFI)

A Swiss-Army Knife for Time Series Analytics

Transitioning from ETL to ELT

Sunday, April 28, 2013

Wednesday, April 24, 2013

Tuesday, April 23, 2013

Monday, April 22, 2013

Preface

Sunday, April 21, 2013

For enterprise IT and the end-users it supports, the interplay between big data and B.I. can prove as exciting as it is frustrating.

Data Warehouse