DevelopBI: Hadoop related projects and frameworks

“Big-data” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.

Batch processing framework and related components

Apache Hadoop: framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)
Cascanding: framework for data management/analytics on Hadoop
Apache Ambari: operational framework for Hadoop mangement
Apache Tez: application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN
Apache Falcon: data management framework
Apache Avro: data serialization system
Apache Flume: service to manage large amount of log data
Apache Sqoop: tool to transfer data between Hadoop and a structured datastore
Apache Thrift: framework to build binary protocols
Apache Hama: BSP (Bulk Synchronous Parallel) computing framework
OpenMPI: message passing framework
Apache Gora: framework for in-memory data model and persistence
Disco: MapReduce framework developed by Nokia

Software and libraries for cluster management

Apache ZooKeeper: centralized service for process management
Apache Helix: cluster management framework
Apache Mesos: cluster manager
Apache Whirr: set of libraries for running cloud services
LinkedIn Norbert: cluster manager
LinkedIn White Elephant: log aggregator and dashboard
LinkedIn Kamikaze: utility package for compressing sorted integer arrays
Serf: decentralized solution for service discovery and orchestration
Apache Knox: single point of secure access for Hadoop clusters
Chronos: distributed and fault-tolerant scheduler
Hortonworks Hoya: application that can deploy HBase cluster on YARN

Data analysis platform and languages

Apache Hive: SQL-like data warehouse system for Hadoop
Apache HCatalog: table and storage management layer for Hadoop
Apache Pig: high level language to express data analysis programs for Hadoop
Twitter Scalding: Scala library for Map Reduce jobs, built on Cascading
Cascading Lingual: SQL-like query language for Cascading
Shark: data warehouse system for Spark
LinkedIn DataFu: collection of user-defined functions for Hadoop and Pig
Pivotal HAWQ: SQL-like data warehouse system for Hadoop
Hunk: Splunk analytics for Hadoop
Cascalog: data processing and querying library
Stinger: interactive query for Hive

Scheduler for Hadoop

Apache Oozie: workflow job scheduler
LinkedIn Azkaban: batch workflow job scheduler

Fast/Streaming big data processing

Apache Drill: framework for interactive analysis, inspired by Dremel
Google BigQuery: framework for interactive analysis, implementation of Dremel
Cloudera Impala: framework for interactive analysis, Inspired by Dremel
Apache Spark: framework for in-memory cluster computing
Apache Spark Streaming: framework for stream processing, part of Spark
Apache S4: framework for stream processing, implementation of S4
Apache Samza: stream processing framework, based on Kafla and YARN
Druid: framework for real-time analysis of large datasets
Tachyon: reliable file sharing at memory speed across cluster frameworks
Apache Storm: framework for stream processing by Twitter also on YARN
LinkedIn Databus: stream of change capture events for a database
BlinkDB: massively parallel, approximate query engine
Splunk: analyzer for machine-generated date
Kiji Project: framework to collect and analyze data in real-time, based on HBase
Amazon Kinesis: real-time processing of streaming data at massive scale
Facebook Peregrine: Map Reduce framework
Summingbird: Streaming MapReduce with Scalding and Storm, by Twitter

Large Scale graph processing

Apache Spark Bagel: implementation of Pregel, part of Spark
GraphX: A Resilient Distributed Graph System on Spark
Apache Giraph: implementation of Pregel, based on Hadoop
Phoebus: framework for large scale graph processing

Machine learning

Apache Mahout: machine learning library for Hadoop
Cascading Pattern: machine learning library for Cascading
PredictionIO: machine learning server buit on Hadoop, Mahout and Cascading
MLbase: distributed machine learning libraries for the BDAS stack
Vowpal Wabbit: learning system sponsored by Microsoft and Yahoo!
H2O: statistical, machine learning and math runtime for Hadoop

Distributed column-oriented data store

Apache HBase: column-oriented distribuited datastore, inspired by BigTable
Apache Cassandra: column-oriented distribuited datastore, inspired by BigTable
HyperTable: column-oriented distribuited datastore, inspired by BigTable
Google BigTable: column-oriented distributed datastore
Parquet: columnar storage format for Hadoop.

Distributed key/value store

Apache Accumulo: distribuited key/value store, built on Hadoop
LinkedIn Voldemort: distributed key/value storage system
Google App Engine Datastore: schemaless object datastore
Amazon DynamoDB: distributed key/value store, implementation of Dynamo
Storehaus: library to work with asynchronous key value stores, by Twitter
ElephantDB: Distributed database specialized in exporting data from Hadoop

Distributed document-oriented data store

LinkedIn Espresso: horizontally scalable document-oriented NoSQL data store
Google App Engine Datastore: schemaless object datastore
Facebook Haystack: object storage system
jumboDB: document oriented datastore over Hadoop

NewSQL platforms

Facebook PrestoDB: distributed SQL query engine
NuoDB: SQL/ACID compliant distributed database
Amazon RedShift: data warehouse service, based on PostgreSQL
FoundationDB: distributed database, inspired by F1
HadoopDB: hybrid of MapReduce and DBMS
InfiniSQL: infinity scalable RDBMS

Distributed graph database

Twitter FlockDB: distribuited graph database
Titan: distributed graph database, built over Cassandra

Data collection systems

Apache Chukwa: data collection system
Facebook Scribe: streamed log data aggregator
Fluentd: tool to collect events and logs

Memcached compatible caching systems

Facebook McDipper: key/value cache for flash storage
Facebook Memcached: fork of Memcache
Twitter Fatcache: key/value cache for flash storage
Twitter Twemcache: fork of Memcache

MySQL forks and evolutions

MySQL Cluster: MySQL implementation using NDB Cluster storage engine
MariaDB: enhanced, drop-in replacement for MySQL
Google Cloud SQL: MySQL databases in Google’s cloud
Amazon RDS: MySQL databases in Amazon’s cloud
Percona Server: enhanced, drop-in replacement for MySQL
ProxySQL: High Performance Proxy for MySQL

Distributed queuing systems

Apache Kafka: distributed publish-subscribe messaging system
Kestrel: distributed message queue system

Search engine and framework

HBase Comprocessor: implementation of Percolator, part of HBase
Apache Lucene: Search engine library
Apache Solr: Search platform for Apache Lucene
ElasticSearch: Search and analytics engine based on Apache Lucene
Sphinx: Fulltext search engine

Not yet public projects

Facebook Scuba: distributed in-memory datastore
Facebook Corona: Hadoop enhancement which removes single point of failure
Facebook Prism: multi datacenters replication system
Facebook Unicorn: social graph search platform
Google Megastore: scalable, highly available storage
Google MillWheel: fault tolerant stream processing framework
Google F1: distributed SQL database
Google Spanner: globally distributed database
Google GFS: distributed filesystem
Google Colossus: distributed filesystem (GFS2)
Google MapReduce: map reduce framework
Google Pregel: graph processing framework

Interesting papers 2001 – 2010

2003 - Google - The Google File System
2004 - Google - MapReduce: Simplied Data Processing on Large Clusters
2006 - Google - Bigtable: A Distributed Storage System for Structured Data
2006 – Google - The Chubby lock service for loosely-coupled distributed systems
2007 – Amazon - Dynamo: Amazon’s Highly Available Key-value Store
2007 - Paxos Made Live – An Engineering Perspective, describe Chubby
2008 – AMPLab - Chukwa: A large-scale monitoring system
2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
2010 – Yahoo - S4: Distributed Stream Computing Platform
2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets
2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations base ofPercolator and Caffeine
2010 - Google - Pregel: A System for Large-Scale Graph Processing
2010 – Google - Storage Architecture and Challenges
2010 – AMPLab - Spark: Cluster Computing with Working Sets
2010 – Facebook - Finding a needle in Haystack: Facebook’s photo storage

Interesting papers 2011 – 2012

2011 – Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services
2011 – AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
2011 – AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters
2012 – Google - Spanner: Google’s Globally-Distributed Database (describe also Colossus)
2012 – Google - Processing a trillion cells per mouse click (base of PowerDrill)
2013 – AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
2012 – Microsoft - Paxos Made Parallel (base of Tribble)
2012 – Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store
2012 – AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory
2012 – AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark
2012 – AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data

Interesting papers 2013 – present

2013 – Facebook – Scaling Memcache at Facebook
2013 - Facebook - Unicorn: A System for Searching the Social Graph
2013 - Facebook - Scuba: Diving into Data at Facebook
2013 – Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale
2013 - Google - F1: A Distributed SQL Database That Scales
2013 - Google - Online, Asynchronous Schema Change in F1
2013 – Metamarkets - Druid: A Real-time Analytical Data Store
2013 – Microsoft - Scalable Progressive Analytics on Big Data in the Cloud
2013 – Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm
2013 – AMPLab - GraphX: A Resilient Distributed Graph System on Spark
2013 – AMPLab - Shark: SQL and Rich Analytics at Scale
2013 – AMPLab - MLbase: A Distributed Machine-learning System
2013 – AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

source : http://blog.andreamostosi.name/big-data/

Saturday, February 1, 2014

Hadoop related projects and frameworks

No comments:

Post a Comment

About Me