Apache Hadoop Ecosystem

Apache Hadoop Ecosystem

  • Hadoop HDFS - 2007 - A distributed file system for reliably storing huge amounts of unstructured, semi-structured and structured data in the form of files. 
  • Hadoop MapReduce - 2007 - A distributed algorithm framework for the parallel processing of large datasets on HDFS filesystem. It runs on Hadoop cluster but also supports other database formats like Cassandra and HBase. 
  • Cassandra - 2008 - A key-value pair NoSQL database, with column family data representation and asynchronous masterless replication. 
  • HBase - 2008 - A key-value pair NoSQL database, with column family data representation, with master-slave replication. It uses HDFS as underlying storage. 
  • Zookeeper - 2008 - A distributed coordination service for distributed applications. It is based on Paxos algorithm variant called Zab. 
  • Pig - 2009 - Pig is a scripting interface over MapReduce for developers who prefer scripting interface over native Java MapReduce programming. 
  • Hive - 2009 - Hive is a SQL interface over MapReduce for developers and analysts who prefer SQL interface over native Java MapReduce programming. 
  • Mahout - 2009 - A library of machine learning algorithms, implemented on top of MapReduce, for finding meaningful patterns in HDFS datasets. 
  • Sqoop - 2010 - A tool to import data from RDBMS/DataWarehouse into HDFS/HBase and export back. 
  • YARN - 2011 - A system to schedule applications and services on an HDFS cluster and manage the cluster resources like memory and CPU. 
  • Flume - 2011 - A tool to collect, aggregate, reliably move and ingest large amounts of data into HDFS. 
  • Storm - 2011 - A system to process high-velocity streaming data with 'at least once' message semantics. 
  • Spark - 2012 - An in-memory data processing engine that can run a DAG of operations. It provides libraries for Machine Learning, SQL interface and near real-time Stream Processing. 
  • Kafka - 2012 - A distributed messaging system with partitioned topics for very high scalability. 
  • SolrCloud - 2012 - A distributed search engine with a REST-like interface for full-text search. It uses Lucene library for data indexing.

Comments

Popular posts from this blog

Big Data Before The Internet

Big Data After The Internet