Big Data Before The Internet

The term ‘Big Data’ was used for the first time in a scientific journal published by NASA, back in 1997 “Visualisation provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.” [1][2]. This was ‘Big Data’ in 1997 which is different from ‘Big Data’ of current times, but the fundamental problem of our ability to scale remains the same. Conceptually, ‘Big Data’ is the data that is beyond the storage and processing power of current systems. It is a moving target. The purpose of this post is to go over this particular aspect of data engineering called Big Data, focusing on big architectural improvements in data management systems.

The genesis of Internet dates back as early as the 1960s - 1970s in the form of NPL network, ARPANET (internet of the cold war), TCP/IP network protocol suite [3] . However, the Internet became the Internet as we use it today on 6th Aug 1991, when Berners-Lee at CERN released World Wide Web for general public use, without much fanfare. The breakthrough was to marry internet with hypertext (URI, HTML, HTTP) [4], which led to the rise of consumer-facing websites, which is how people experience the internet today.

With these two terms 'Big Data' and 'Internet' behind us, we are ready to look into the advances in the field of Big Data and where we are heading. I will divide Big Data advances into three phases.

Phase 1 - Big Data before the Internet: This was the era, before the rise of the Internet, when electronic data was predominantly being created by enterprises (private organizations), institutions for scientific research and education etc.

Phase 2 - Big Data after the Internet: The second phase of Big Data began when organizations like Google, Yahoo, Microsoft etc started collecting data created by consumers and data that was freely available on the web. It had profound implications in the field of data engineering for two reasons. One, the scale of data that was being collected was unprecedented. Second, the nature of data was unstructured.

Phase 3 - Big Data with the Internet Of Things: Third phase of Big Data is what is popularly known as the Internet of Things. This phase is being propelled particularly by the advances in the field of digital sensors and our ability to ingest and process the data available through them. It is not that the world was not using digital sensors before, but much like the story of the Internet, usage of digital sensors is crossing an inflection point now. Soon machines feeding data into our systems will be ubiquitous. This phase is not only going to bring the unseen scale of data but also make processing of data in real time far more critical. If we are to believe Seagate & IDC research, ten percent of this data will be hypercritical to our daily lives [6].

I will cover these phases in three separate posts. The current one is the first phase.

Operational Systems

RDBMS has been the mainstay of data engineering ever since it came into existence. Though the idea of RDBMS originated in the early 1970s, and IBM developed some solutions, the first commercially available RDBMS was released by Oracle in 1979. It was followed by DB2, Sybase etc [7]. RDBMS was an immensely successful system for storing data persistently and querying the same with a great deal of efficiency. They could do so in a highly concurrent multi-user environment. Pretty much all organizations that were invested in IT infrastructure used RDBMS for executing transactions and performing analytics. As a matter of fact, transactional systems came into existence first. These systems used RDBMS for storing operational data and were called OLTP (Online Transaction Processing) systems. Operational data refers to current data that is created and updated by ongoing business processes. From a technical standpoint, RDBMS storing this kind of data undergo simultaneous reads and writes by multiple concurrent business transactions. Each transaction inserts/reads/updates a very limited number of records related to that particular transaction only.

Decision Support Systems

The early 1980s saw the emergence of decision support (analytics) systems, but it was not until 1988 that the term ‘Data Warehouse’ was coined. In the year 1988, Barry Devlin and Paul Murphy publish the article "An architecture for a business and information system" where they introduced the term "business data warehouse"

"The transaction-processing environment in which companies maintain their operational databases was the original target for computerization and is now well understood. On the other hand, access to company information on a large scale by an end user for reporting and data analysis is relatively new" [8].

Any large enterprise had multiple 'operational systems' due to multiple business functions, multiple locations, multiple software vendors etc. In order to do enterprise level analytics, one had to deal with multiple operational databases with different technologies, with multiple data models, multiple locations etc. Analytics required summarised enterprise view over several months or years which created severe read loads on 'operational systems' which they were not designed for. Thus came Data Warehouses, the foundation for Decision Support Systems. They stored integrated data of an enterprise for very long durations. They periodically extracted data from operational systems through ETL (Extract Transform Load) processes. Periodically extracting data from Operational systems was far more efficient than any analytical process going over that data multiple times. Once data reached a Data Warehouse, it was ready for reporting and analysis, by a process called Business Intelligence. The following picture presents a simplified view of a Data Warehouse architecture. Please read my other post for more details on Enterprise Systems.

With time, database sizes for analytics were growing. The primary reason was that business started storing more data, as they recognized the value of maintaining more historical data online longer for better analytics and decision making. The secondary reason was tightening regulatory and compliance requirements, which meant keeping more data for longer durations [9].

In order to run a large database, one option was to buy a large server with multiple processors, large memory, and big storage capacity. These are called SMP (Symmetric Multi-Processing) systems. They have multiple processors sharing the same main memory and data storage. Such systems based on a single machine were limited by the capacity of that machine. Choosing this option as a strategy to handle increasing data loads meant exponential costs. Any improvements gained by using superior hardware also reached its limits as data started to touch 1 Terabytes. Data at that scale was enough to bring down database performance considerably.

The other option was to use an RDBMS cluster. Oracle released clustering capability for the first time in the year 1985, renamed as Oracle Parallel Server in 1988, and yet again as RAC (Real Application Cluster) in the year 2001 [10]. This clustered architecture is called as ‘shared disk’ architecture where multiple database instances/nodes, running database software, share the same database. Clustering provides enterprises with high availability. If one node goes down, there are other nodes to fall back to. Clustering helped in sharing the load of multiple users among multiple nodes. It also provided the ability to execute a query in parallel as multiple units of work. This required meticulous design considerations like partitioning of data as logical disjoint datasets, to avoid internode data movement between node caches. Oracle 10g RAC scalability has been tested by Quest Software and Dell Computers, with 10 nodes. Their research paper projected Oracle RAC to scale linearly with 16 nodes. It is interesting that the authors of this paper thanked Dell Computers for allocating equipment worth a million dollar for this testing [11]. One can easily guess how expensive these systems are!

Massively Parallel Processing

In the late 1980s and early 1990s, data warehousing concept gained momentum. That was accompanied by the rise of MPP (Massively Parallel Processing) databases which soon became the de facto approach for large-scale data warehouses. They used RDBMS in shared nothing architecture. They were by far the most scalable systems to handle data warehousing kind of loads.

Surprisingly enough, their history dates back to the year 1984, when Teradata released MPP database 'DBC 1012' as backend database system for mainframe computers [12, 13]. The name, Teradata, signifies the capability of storing one trillion bytes of data. However, at this point, the concept of data warehouse had not yet emerged. Operational systems were still in use for analytics. This was coincidentally the time when businesses began to shift from Mainframes to more affordable Client Server systems based on Unix.

This architecture was well suited for parallel querying with no shared data. Table rows were split across multiple nodes, by assigning each row to a node through a random but deterministic method called hashing. All these nodes were linked together using a high performance interconnect known as Ynet. In order to protect data against failures, data was logically copied to different nodes. Subsequent MPP architectures remained fundamentally the same, but they evolved to solve administrative problems like tuning and repartitioning the ever-increasing data. Many other vendors entered into this space like Netezza, DB2, HP Neoview, Oracle Exadata etc. They started delivering MPP as a packaged solution of hardware and software combined. These systems were optimised for handling data warehousing loads and were easier to deploy, scale and administer. These packaged systems are known as Data Warehouse Appliance.

On January 1992, Walmart went live with a Teradata MPP to handle 1 TB of data. It then grew up to 11 TB in 1996, 24 TB in 1997, 130 TB in 1999, 423 TB in 2004, 1 PB in 2007 and 2.4 PB in 2008 [14]. In the year 2008, Walmart contracted HP to build a data warehouse for analyzing 4 Petabytes of data [15]. This mirrors the massive data growth within enterprises.

Summary

Big Data as a problem is not a new problem. It has existed before and will continue to exist in the near foreseeable future. However, primary drivers for data growth vary over time. In this post we looked at significant systems and architectures, that were created before the internet era, and how they evolved over a period of time, in response to the Big Data problem. Some important aspects of these systems were:

All these systems dealt with structured data - data that can be represented in a table column format
All these systems were licensed products, provided by different vendors. Their cost ran into millions of dollars.
Many of them used proprietary hardware.

References

Comments

Ravish Sinha14 October 2017 at 21:39
Extremely interesting post..
Kevin Nelson21 June 2021 at 11:02
These tools include key features such as data visualization, visual analytics, interactive dashboarding and KPI scorecards. They enable users to utilize automated reporting and predictive analytics features based on self-service. Know more
Aaron jhonson30 November 2021 at 04:57
Very Nice, Thanks for sharing such an informative Article. I really Enjoyed. It was great reading this article. I would like to know more about data engineering solutions

Search This Blog

Blog