Showing posts from October, 2017

Apache Hadoop Ecosystem

Hadoop HDFS - 2007 - A distributed file system for reliably storing huge amounts of unstructured, semi-structured and structured data in the form of files.  Hadoop MapReduce - 2007 - A distributed algorithm framework for the parallel processing of large datasets on HDFS filesystem. It runs on Hadoop cluster but also supports other database formats like Cassandra and HBase.  Cassandra - 2008 - A key-value pair NoSQL database, with column family data representation and asynchronous masterless replication.  HBase - 2008 - A key-value pair NoSQL database, with column family data representation, with master-slave replication. It uses HDFS as underlying storage.  Zookeeper - 2008 - A distributed coordination service for distributed applications. It is based on Paxos algorithm variant called Zab.  Pig - 2009 - Pig is a scripting interface over MapReduce for developers who prefer scripting interface over native Java MapReduce programming.  Hive - 2009 - Hive is a SQL interf

Big Data After The Internet

Till 1995 most of the people did not know about the internet. It was hard to use, till the Netscape browser arrived and its famous IPO happened. The arrival of Netscape meant anyone could create material and anyone with a connection could view it. Internet's popularity resulted in mushrooming of websites like AOL, MSN, Yahoo, CNN, Napster and so many more. They  provided free information sharing services like emails, chats, photograph sharing, video sharing, blogging, news, weather, music, games etc. These sites were generating, collecting and sharing an enormous amount of data, for the people all over the globe. There were, of course, new generation e-commerce companies like Amazon and eBay that also contributed to the overall information available, but sharing of information was not at the core of their strategy.   Why this phenomenon of information sharing noteworthy? There are two good reasons:  The data on the Internet was freely available to everyone on the Interne

Big Data Before The Internet

The term ‘Big Data’ was used for the first time in a scientific journal published by NASA, back in 1997 “ Visualisation provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources. ” [1][2] . This was ‘Big Data’ in 1997 which is different from ‘Big Data’ of current times, but the fundamental problem of our ability to scale remains the same. Conceptually, ‘Big Data’ is the data that is beyond the storage and processing power of current systems. It is a moving target. The purpose of this post is to go over this particular aspect of data engineering called Big Data, focusing on big architectural improvements in data management systems. The genesis of Internet dates back as early as

Enterprise Systems for Analytics

This post is an addendum to ' Big Data Before The Internet '. It  talks about the Data Warehouse architecture at a very high level, which is important for understanding Big Data systems.  The architecture of Analytics in Enterprise Systems has three major parts:  Operational Systems  Data Warehouse  Business Intelligence  Operational Systems Operational systems use RDBMS to store operational data, that is current in nature and is constantly read and modified. We can understand operational data through a simple Retail example. If a customer places an order on an eCommerce website, this order is an operational data in the Retailer’s system. The retailer modifies the order during its entire fulfillment process by adding various details ranging from inventory details, shipment details, delivery details etc. The customer also can modify this order by changes like cancellation. In summary, the order is constantly modified by the Retailer or by the Consum