Enterprise Systems for Analytics

This post is an addendum to 'Big Data Before The Internet'. It talks about the Data Warehouse architecture at a very high level, which is important for understanding Big Data systems. 

The architecture of Analytics in Enterprise Systems has three major parts: 

  • Operational Systems 
  • Data Warehouse 
  • Business Intelligence 

Operational Systems

Operational systems use RDBMS to store operational data, that is current in nature and is constantly read and modified. We can understand operational data through a simple Retail example. If a customer places an order on an eCommerce website, this order is an operational data in the Retailer’s system. The retailer modifies the order during its entire fulfillment process by adding various details ranging from inventory details, shipment details, delivery details etc. The customer also can modify this order by changes like cancellation. In summary, the order is constantly modified by the Retailer or by the Consumer till it is delivered. Once the order reaches a stage when it cannot be returned back, the order is no longer remains operational data and becomes historical data which is useful for analytics and reporting. A typical retailer has many operational systems like ERP, CRM, Warehouse Management, Logistics, Order Management, Point of Sales etc.

As Operational data is both read and modified frequently, databases used for handling such data are optimised for a large number of concurrent read and writes called transactions. These reads (queries) and writes (inserts, updates, and deletes) in a transaction involve a small number of records related to only that transaction. 

There are two very significant limitations of Operational Systems: 
  • Operational systems are not optimised for large read loads that are generated by analytics. On the other hand, the systems for analytics use indexing very generously, which transactional systems cannot afford as it severely slows down the writes. Once transactional data becomes historical data, it is moved out of operational systems into a Data Warehouse system for analytics. 
  • All Operational systems within an enterprise work with overlapping data that is logically the same but is not consistently represented across different systems. Data models differ from system to system. The data may have different IDs for the same entities. Same entities may have different table names, column names, data types, even different values etc. This makes a holistic analysis of this data extremely cumbersome.  

Data Warehouse

Data Warehouses are also RDBMS based systems and they address the above two limitations of Operational systems. They are optimised for large reads and are a centralised store of integrated data. They are the single point of reference for the data in an enterprise. The data inside a Data Warehouse is clean, integrated and granular. It conforms to integrity constraints and is not a subject to change. So, the data inside a Data Warehouse is used for reliable analytics and reporting.

The process that is responsible for pulling data from Operational systems into a Data Warehouse is called ETL (Extract Transform Load). It pulls data into a staging area that resembles the structure in Operational systems. The data is then cleaned, transformed and pushed into a repository called ODS (Operational Data Store). The data inside an ODS is clean and integrated, but still subject to modification.
For ETL process to transform and move data from Operational System to Data warehouse format, ETL processes need information about the data model of both the systems and mapping logic. This information is called Metadata. 

Once a set of data in an ODS reaches a state where its no longer modified, it is moved into a Data Warehouse. The data in Data Warehouse can be further subdivided into independent subject areas like HR, Finance, Sales, Inventory etc. For easy viewing of this data, the data related to each subject area is physically separated into subsets called Data Marts. The data in Data Marts is summarised and denormalized for 'read' query efficiency.

The above architecture of Data Warehouse is called Inmon architecture after its creator Bill Inmon. There is another architecture of data warehousing called Kimball architecture whose creator was Ralph Kimball. These two architectures differ in their implementation approach, but from Big Data point of view, these differences are insignificant. For large enterprises, Inmon architecture is more robust. 

Business Intelligence

BI transform raw data in a Data Warehouse into useful information for decision making. The primary purpose is to get reports that management can look into and make informed decisions. Majority of other users get daily reports on key performance metrics. Analysts can use it for analyzing data through ad hoc queries, data mining etc. Reports can be simple spreadsheets or graphical visualization. They can be personalized or statistical in nature.


Post a Comment

Popular posts from this blog

Apache Hadoop Ecosystem

Big Data After The Internet