Saturday, February 1, 2014

Hadoop related projects and frameworks

Big-data” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.
Batch processing framework and related components
  • Apache Hadoop: framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)
  • Cascanding: framework for data management/analytics on Hadoop
  • Apache Ambari: operational framework for Hadoop mangement
  • Apache Tez: application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN
  • Apache Falcon: data management framework
  • Apache Avro: data serialization system
  • Apache Flume: service to manage large amount of log data
  • Apache Sqoop: tool to transfer data between Hadoop and a structured datastore
  • Apache Thrift: framework to build binary protocols
  • Apache Hama: BSP (Bulk Synchronous Parallel) computing framework
  • OpenMPI: message passing framework
  • Apache Gora: framework for in-memory data model and persistence
  • Disco: MapReduce framework developed by Nokia
Software and libraries for cluster management
Data analysis platform and languages
  • Apache Hive: SQL-like data warehouse system for Hadoop
  • Apache HCatalog: table and storage management layer for Hadoop
  • Apache Pig: high level language to express data analysis programs for Hadoop
  • Twitter Scalding: Scala library for Map Reduce jobs, built on Cascading
  • Cascading Lingual: SQL-like query language for Cascading
  • Shark: data warehouse system for Spark
  • LinkedIn DataFu: collection of user-defined functions for Hadoop and Pig
  • Pivotal HAWQ: SQL-like data warehouse system for Hadoop
  • HunkSplunk analytics for Hadoop
  • Cascalog: data processing and querying library
  • Stinger: interactive query for Hive
Scheduler for Hadoop
Fast/Streaming big data processing
  • Apache Drill: framework for interactive analysis, inspired by Dremel
  • Google BigQuery: framework for interactive analysis, implementation of Dremel
  • Cloudera Impala: framework for interactive analysis, Inspired by Dremel
  • Apache Spark: framework for in-memory cluster computing
  • Apache Spark Streaming: framework for stream processing, part of Spark
  • Apache S4: framework for stream processing, implementation of S4
  • Apache Samza: stream processing framework, based on Kafla and YARN
  • Druid: framework for real-time analysis of large datasets
  • Tachyon: reliable file sharing at memory speed across cluster frameworks
  • Apache Storm: framework for stream processing by Twitter also on YARN
  • LinkedIn Databus: stream of change capture events for a database
  • BlinkDB: massively parallel, approximate query engine
  • Splunk: analyzer for machine-generated date
  • Kiji Project: framework to collect and analyze data in real-time, based on HBase
  • Amazon Kinesis: real-time processing of streaming data at massive scale
  • Facebook Peregrine: Map Reduce framework
  • Summingbird: Streaming MapReduce with Scalding and Storm, by Twitter
Large Scale graph processing
  • Apache Spark Bagel: implementation of Pregel, part of Spark
  • GraphX: A Resilient Distributed Graph System on Spark
  • Apache Giraph: implementation of Pregel, based on Hadoop
  • Phoebus: framework for large scale graph processing
Machine learning
  • Apache Mahout: machine learning library for Hadoop
  • Cascading Pattern: machine learning library for Cascading
  • PredictionIO: machine learning server buit on HadoopMahout and Cascading
  • MLbase: distributed machine learning libraries for the BDAS stack
  • Vowpal Wabbit: learning system sponsored by Microsoft and Yahoo!
  • H2O: statistical, machine learning and math runtime for Hadoop
Distributed column-oriented data store 
  • Apache HBase: column-oriented distribuited datastore, inspired by BigTable
  • Apache Cassandra: column-oriented distribuited datastore, inspired by BigTable
  • HyperTable: column-oriented distribuited datastore, inspired by BigTable
  • Google BigTable: column-oriented distributed datastore
  • Parquet: columnar storage format for Hadoop.
Distributed key/value store
Distributed document-oriented data store 
NewSQL platforms
Distributed graph database 
  • Twitter FlockDB: distribuited graph database
  • Titan: distributed graph database, built over Cassandra
Data collection systems
Memcached compatible caching systems
MySQL forks and evolutions
Distributed queuing systems
  • Apache Kafka: distributed publish-subscribe messaging system
  • Kestrel: distributed message queue system
Search engine and framework
Not yet public projects
  • Facebook Scuba: distributed in-memory datastore
  • Facebook CoronaHadoop enhancement which removes single point of failure
  • Facebook Prism: multi datacenters replication system
  • Facebook Unicorn: social graph search platform
  • Google Megastore: scalable, highly available storage
  • Google MillWheel: fault tolerant stream processing framework
  • Google F1: distributed SQL database
  • Google Spanner: globally distributed database
  • Google GFS: distributed filesystem
  • Google Colossus: distributed filesystem (GFS2)
  • Google MapReduce: map reduce framework
  • Google Pregel: graph processing framework
Interesting papers 2001 – 2010
Interesting papers 2011 – 2012
Interesting papers 2013 – present

source : http://blog.andreamostosi.name/big-data/

No comments:

Post a Comment