March 1st, 2017
Excerpt from ebook, Architecting Data Lakes: Data Management Architectures for Advanced Business Use Cases, by Ben Sharma and Alice LaPlante.
Enterprise data warehouses (EDWs) have been many organizations’ primary mechanism for performing complex business analytics, reporting, and operations. But they are too rigid to work in the era of big data, where large data volumes and broad data variety are the norms. It is challenging to change EDW data models, and field-to-field integration mappings are rigid. EDWs are also expensive.
Perhaps more importantly, most EDWs require that business users rely on IT to do any manipulation or enrichment of data, largely because of the inflexible design, system complexity, and intolerance for human error in EDWs.
Data lakes solve all these challenges, and more. As a result, almost every industry has a potential data lake use case. For example, organizations can use data lakes to get better visibility into data, eliminate data silos, and capture 360-degree views of customers.
With data lakes, organizations can finally unleash Big Data’s potential across industries.
Because data can be unstructured as well as structured, you can store everything from blog postings to product reviews. And the data doesn’t have to be consistent to be stored in a data lake. For example, you may have the same type of information in very different data formats, depending on who is providing the data. This would be problematic in an EDW; in a data lake, however, you can put all sorts of data into a single repository without worrying about schemas that define the integration points between different data sets.
Today’s data world is a streaming world. Streaming has evolved from rare use cases, such as sensor data from the IoT and stock market data, to very common everyday data, such as social media.
When you store data in an EDW, it works well for certain kinds of analytics. But when you are using Spark, MapReduce, or other new models, preparing data for analysis in an EDW can take more time than performing the actual analytics. In a data lake, data can be processed efficiently by these new paradigm tools without excessive prep work. Integrating data involves fewer steps because data lakes don’t enforce a rigid metadata schema. Schema-on-read allows users to build custom schema into their queries upon query execution.
Data lakes also solve the challenge of data integration and accessibility that plague EDWs. Using Big Data Hadoop infrastructures, you can bring together ever-larger data volumes for analytics—or simply store them for some as-yet-undetermined future use. Unlike a monolithic view of a single enterprise-wide data model, the data lake allows you to put off modeling until you actually use the data, which creates opportunities for better operational insights and data discovery. This advantage only grows as data volumes, variety, and metadata richness increase.
Because of economies of scale, some Hadoop users claim they pay less than $1,000 per terabyte for a Hadoop cluster. Although numbers can vary, business users understand that because it’s no longer excessively costly for them to store all their data, they can maintain copies of everything by simply dumping it into Hadoop, to be discovered and analyzed later.
Big Data is typically defined as the intersection between volume, variety, and velocity. EDWs are notorious for not being able to scale beyond a certain volume due to restrictions of the architecture. Data processing takes so long that organizations are prevented from exploiting all their data to its fullest extent. Using Hadoop, petabyte- scale data lakes are both cost-efficient and relatively simple to build and maintain at whatever scale is desired.
Download the full ebook here.