Data Lake Architecture
Building a data lake architecture and selecting a technology stack is a complex undertaking, requiring the integration of numerous different technologies for data storage, ingestion, processing, operations, governance, security and data analytics – as well as specialized expertise to make it all work. When selecting your tech stack, it is important to choose technologies that are scalable, extensible, modular and interoperable so that you have the option to incorporate new and emerging tools and technologies as they evolve. The big data landscape continues to change rapidly – so this really is critical to keep in mind to ensure you make the most of your investment.
Next Generation Data Lakes: On-Prem vs. Cloud
A modern data lake infrastructure should integrate both on-premise and cloud storage. Even in today’s world, where cloud adoption seems to be the go-to strategy of every IT expert, on-prem storage and processing in reality are important to enterprise-wide data lakes, as they provide tighter control of data security and data privacy.
However, the cloud also is vital to the data lake. It offers the highly scalable and elastic storage and computing resources enterprises need for large-scale processing and data storage – without the overhead of provisioning and maintaining expensive infrastructure. Also, as big data tools and technologies continue to rapidly change, cloud-based data lakes can be used as development or test environments to evaluate new tools and technologies before bringing them to production, either in the cloud or on-prem.
Data Lake Storage: Your Options
- HDFS: For on-prem data lakes, HDFS remains the storage of choice, as it provides distributed data with replication. This allows for faster processing of big data use cases. HDFS also allows enterprises to create storage tiers for data lifecycle management, leveraging those tiers to save on cost, while maintaining data retention policies and regulatory requirements.
- Cloud storage: Cloud-based storage offers a unique advantage as it allows for the decoupling of storage and compute, enabling enterprises to cut storage costs and leverage different compute powers to meet specific use case demands. Cloud storage also allows for creating tiered storage to optimize cost and for data retention and regulatory requirements.
Data Lake Processing Technologies
- Hadoop clusters: Hadoop is typically central to an on-prem data lake, as it allows for distributed processing of large datasets across processing clusters in the enterprise. It also can be deployed in a cloud-based data lake to create a hybrid, enterprise-wide data lake using a single distribution (e.g., Hortonworks, Cloudera and MapR).
- Spark clusters: Apache Spark provides a much faster engine for large-scale data processing, leveraging in-memory computing. It can run on Hadoop, Mesos, in-cloud or in a standalone environment to create a unified compute layer across the enterprise.
- Apache Beam: Apache Beam provides an abstraction on top of the processing cluster. With Beam, enterprises can develop their data processing pipelines using Beam SDK, and then choose a Beam runner to run the pipeline on a specific large-scale data processing system. The runner can be anything from a Direct Runner, Apex, Flink, Spark, Dataflow, or Gearpump (incubating). This design allows for the processing pipeline to be portable across different runners, providing the enterprise with flexibility to leverage the best platform to meet its data processing requirements in a future-proof way.
Knitting It Together: The Data Management Platform
The technology stack needed for a successful data lake is extensive and varied. This poses the question: how can enterprises possibly manage data across such a complex technology stack? Enter the data management platform. A robust data management platform is the key to enabling enterprises to manage and track data across various storage, compute and processing layers, as well as throughout its lifecycle. Not only does this transparency lend itself to reduced data preparation time, easier data discovery and faster business insights, it ensures enterprises can meet regulatory requirements around data privacy, security and governance.