Tuesday, February 11, 2014

Mahout - Future Directions

Introduction

The Apache Mahout Machine Learning Library’s goal is to build scalable Machine Learning libraries. Mahout’s focus is primarily in the areas of Collaborative Filtering (Recommenders), Clustering and Classification (known as the "3Cs"), as well as the necessary infrastructure to support those implementations. That would include, math packages for statistics, linear algebra and others as well as Java primitive collections, local and distributed vector and matrix classes and a variety of integrative code to work with popular packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache Cassandra and more.
Future Releases

Saturday, February 1, 2014

Spoilt for Choice – How to choose the right Big Data / Hadoop Platform?

Big data becomes a relevant topic in many companies this year. Although there is no standard definition of the term „big data“, Hadoop is the de facto standard for processing big data. Almost all big software vendors such as IBM, Oracle, SAP, or even Microsoft use it. However, when you have decided to use Hadoop, the first question is how to start and which product to choose for your big data processes. Several alternatives exist for installing a version of Hadoop and realizing big data processes. This article discusses different alternatives and recommends when to use which one.

Using Apache Storm for real-time analytics at Rocket Lawyer.

With today’s data technologies, storing data and scaling the infrastructure is becoming a non-issue with HDFS, Hadoop, and related architectures. Hadoop provides the batch-processing framework with MapReduce for processing the data. However, batch processing poses challenges with high data read latency for use cases like real-time analytics, clickstream visualization, and machine learning. We needed a real-time system to process our customer and system generated data as it happens to make important and quick business decisions. At Rocket Lawyer, we have chosen Apache Storm to supplement our data platform with real-time processing capabilities.

Hadoop related projects and frameworks

Big-data” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.