“Big-data” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.
Batch processing framework and related components
- Apache Hadoop: framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)
- Cascanding: framework for data management/analytics on Hadoop
- Apache Ambari: operational framework for Hadoop mangement
- Apache Tez: application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN
- Apache Falcon: data management framework
- Apache Avro: data serialization system
- Apache Flume: service to manage large amount of log data
- Apache Sqoop: tool to transfer data between Hadoop and a structured datastore
- Apache Thrift: framework to build binary protocols
- Apache Hama: BSP (Bulk Synchronous Parallel) computing framework
- OpenMPI: message passing framework
- Apache Gora: framework for in-memory data model and persistence
- Disco: MapReduce framework developed by Nokia
Software and libraries for cluster management
- Apache ZooKeeper: centralized service for process management
- Apache Helix: cluster management framework
- Apache Mesos: cluster manager
- Apache Whirr: set of libraries for running cloud services
- LinkedIn Norbert: cluster manager
- LinkedIn White Elephant: log aggregator and dashboard
- LinkedIn Kamikaze: utility package for compressing sorted integer arrays
- Serf: decentralized solution for service discovery and orchestration
- Apache Knox: single point of secure access for Hadoop clusters
- Chronos: distributed and fault-tolerant scheduler
- Hortonworks Hoya: application that can deploy HBase cluster on YARN
Data analysis platform and languages
- Apache Hive: SQL-like data warehouse system for Hadoop
- Apache HCatalog: table and storage management layer for Hadoop
- Apache Pig: high level language to express data analysis programs for Hadoop
- Twitter Scalding: Scala library for Map Reduce jobs, built on Cascading
- Cascading Lingual: SQL-like query language for Cascading
- Shark: data warehouse system for Spark
- LinkedIn DataFu: collection of user-defined functions for Hadoop and Pig
- Pivotal HAWQ: SQL-like data warehouse system for Hadoop
- Hunk: Splunk analytics for Hadoop
- Cascalog: data processing and querying library
- Stinger: interactive query for Hive
Scheduler for Hadoop
- Apache Oozie: workflow job scheduler
- LinkedIn Azkaban: batch workflow job scheduler
Fast/Streaming big data processing
- Apache Drill: framework for interactive analysis, inspired by Dremel
- Google BigQuery: framework for interactive analysis, implementation of Dremel
- Cloudera Impala: framework for interactive analysis, Inspired by Dremel
- Apache Spark: framework for in-memory cluster computing
- Apache Spark Streaming: framework for stream processing, part of Spark
- Apache S4: framework for stream processing, implementation of S4
- Apache Samza: stream processing framework, based on Kafla and YARN
- Druid: framework for real-time analysis of large datasets
- Tachyon: reliable file sharing at memory speed across cluster frameworks
- Apache Storm: framework for stream processing by Twitter also on YARN
- LinkedIn Databus: stream of change capture events for a database
- BlinkDB: massively parallel, approximate query engine
- Splunk: analyzer for machine-generated date
- Kiji Project: framework to collect and analyze data in real-time, based on HBase
- Amazon Kinesis: real-time processing of streaming data at massive scale
- Facebook Peregrine: Map Reduce framework
- Summingbird: Streaming MapReduce with Scalding and Storm, by Twitter
Large Scale graph processing
- Apache Spark Bagel: implementation of Pregel, part of Spark
- GraphX: A Resilient Distributed Graph System on Spark
- Apache Giraph: implementation of Pregel, based on Hadoop
- Phoebus: framework for large scale graph processing
Machine learning
- Apache Mahout: machine learning library for Hadoop
- Cascading Pattern: machine learning library for Cascading
- PredictionIO: machine learning server buit on Hadoop, Mahout and Cascading
- MLbase: distributed machine learning libraries for the BDAS stack
- Vowpal Wabbit: learning system sponsored by Microsoft and Yahoo!
- H2O: statistical, machine learning and math runtime for Hadoop
Distributed column-oriented data store
- Apache HBase: column-oriented distribuited datastore, inspired by BigTable
- Apache Cassandra: column-oriented distribuited datastore, inspired by BigTable
- HyperTable: column-oriented distribuited datastore, inspired by BigTable
- Google BigTable: column-oriented distributed datastore
- Parquet: columnar storage format for Hadoop.
Distributed key/value store
- Apache Accumulo: distribuited key/value store, built on Hadoop
- LinkedIn Voldemort: distributed key/value storage system
- Google App Engine Datastore: schemaless object datastore
- Amazon DynamoDB: distributed key/value store, implementation of Dynamo
- Storehaus: library to work with asynchronous key value stores, by Twitter
- ElephantDB: Distributed database specialized in exporting data from Hadoop
Distributed document-oriented data store
- LinkedIn Espresso: horizontally scalable document-oriented NoSQL data store
- Google App Engine Datastore: schemaless object datastore
- Facebook Haystack: object storage system
- jumboDB: document oriented datastore over Hadoop
NewSQL platforms
- Facebook PrestoDB: distributed SQL query engine
- NuoDB: SQL/ACID compliant distributed database
- Amazon RedShift: data warehouse service, based on PostgreSQL
- FoundationDB: distributed database, inspired by F1
- HadoopDB: hybrid of MapReduce and DBMS
- InfiniSQL: infinity scalable RDBMS
Distributed graph database
- Twitter FlockDB: distribuited graph database
- Titan: distributed graph database, built over Cassandra
Data collection systems
- Apache Chukwa: data collection system
- Facebook Scribe: streamed log data aggregator
- Fluentd: tool to collect events and logs
Memcached compatible caching systems
- Facebook McDipper: key/value cache for flash storage
- Facebook Memcached: fork of Memcache
- Twitter Fatcache: key/value cache for flash storage
- Twitter Twemcache: fork of Memcache
MySQL forks and evolutions
- MySQL Cluster: MySQL implementation using NDB Cluster storage engine
- MariaDB: enhanced, drop-in replacement for MySQL
- Google Cloud SQL: MySQL databases in Google’s cloud
- Amazon RDS: MySQL databases in Amazon’s cloud
- Percona Server: enhanced, drop-in replacement for MySQL
- ProxySQL: High Performance Proxy for MySQL
Distributed queuing systems
- Apache Kafka: distributed publish-subscribe messaging system
- Kestrel: distributed message queue system
Search engine and framework
- HBase Comprocessor: implementation of Percolator, part of HBase
- Apache Lucene: Search engine library
- Apache Solr: Search platform for Apache Lucene
- ElasticSearch: Search and analytics engine based on Apache Lucene
- Sphinx: Fulltext search engine
Not yet public projects
- Facebook Scuba: distributed in-memory datastore
- Facebook Corona: Hadoop enhancement which removes single point of failure
- Facebook Prism: multi datacenters replication system
- Facebook Unicorn: social graph search platform
- Google Megastore: scalable, highly available storage
- Google MillWheel: fault tolerant stream processing framework
- Google F1: distributed SQL database
- Google Spanner: globally distributed database
- Google GFS: distributed filesystem
- Google Colossus: distributed filesystem (GFS2)
- Google MapReduce: map reduce framework
- Google Pregel: graph processing framework
Interesting papers 2001 – 2010
- 2003 - Google - The Google File System
- 2004 - Google - MapReduce: Simplied Data Processing on Large Clusters
- 2006 - Google - Bigtable: A Distributed Storage System for Structured Data
- 2006 – Google - The Chubby lock service for loosely-coupled distributed systems
- 2007 – Amazon - Dynamo: Amazon’s Highly Available Key-value Store
- 2007 - Paxos Made Live – An Engineering Perspective, describe Chubby
- 2008 – AMPLab - Chukwa: A large-scale monitoring system
- 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
- 2010 – Yahoo - S4: Distributed Stream Computing Platform
- 2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets
- 2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notiļ¬cations base ofPercolator and Caffeine
- 2010 - Google - Pregel: A System for Large-Scale Graph Processing
- 2010 – Google - Storage Architecture and Challenges
- 2010 – AMPLab - Spark: Cluster Computing with Working Sets
- 2010 – Facebook - Finding a needle in Haystack: Facebook’s photo storage
Interesting papers 2011 – 2012
- 2011 – Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services
- 2011 – AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
- 2011 – AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters
- 2012 – Google - Spanner: Google’s Globally-Distributed Database (describe also Colossus)
- 2012 – Google - Processing a trillion cells per mouse click (base of PowerDrill)
- 2013 – AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
- 2012 – Microsoft - Paxos Made Parallel (base of Tribble)
- 2012 – Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store
- 2012 – AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory
- 2012 – AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark
- 2012 – AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data
Interesting papers 2013 – present
- 2013 – Facebook – Scaling Memcache at Facebook
- 2013 - Facebook - Unicorn: A System for Searching the Social Graph
- 2013 - Facebook - Scuba: Diving into Data at Facebook
- 2013 – Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale
- 2013 - Google - F1: A Distributed SQL Database That Scales
- 2013 - Google - Online, Asynchronous Schema Change in F1
- 2013 – Metamarkets - Druid: A Real-time Analytical Data Store
- 2013 – Microsoft - Scalable Progressive Analytics on Big Data in the Cloud
- 2013 – Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm
- 2013 – AMPLab - GraphX: A Resilient Distributed Graph System on Spark
- 2013 – AMPLab - Shark: SQL and Rich Analytics at Scale
- 2013 – AMPLab - MLbase: A Distributed Machine-learning System
- 2013 – AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
source : http://blog.andreamostosi.name/big-data/
No comments:
Post a Comment