What is a Cloud Data Lake?

Avatar photo Team Zaloni October 3rd, 2018

Many organizations that we talk to are interested in leveraging cloud infrastructure as their data lake. They’re smart to consider it. It’s a highly flexible deployment where you only pay for the compute and storage. For companies that have highly varying levels of processing needs, this paradigm can offer a significantly lower price point and shifts management of hardware to a third party.

What is a cloud data lake?

Contrary to what some organizations have been led to believe, a cloud-based data lake is not an S3 bucket where data is dumped. A data lake is a maintainable, functioning infrastructure that maintains governance across all of the data. It provides access to the correct people at the appropriate stages of the data lifecycle and can adhere to a zone-based architecture specific to an organization’s needs. A data lake should also provide self-service access to end users reducing overhead on IT.

Benefits of running in the cloud

Cloud providers have developed a plethora of services and tools that can be used by organizations in multiple ways. This means cloud subscribers have lots of pieces they can build their infrastructure upon. The cost to try (and potentially fail with) a number of options that could work is minimal.

An organization can develop upon the tools the cloud providers have to develop a fully functioning data lake. They can start small and scale out if necessary. In short, the cloud provides a scalable architecture with low upfront cost. As the needs of the organization increase, they can scale their compute, storage, and application requests.

In a cloud-based infrastructure, an organization only pays for the amount they use. For example, if an organization has high compute needs, but for short bursts, they are ideal candidates for savings.

Downside of the Cloud

Although cloud vendors offload a lot of the risk associated with data storage and security, those risks are still very real. The cloud vendor of choice needs to take this sort of risk into account. Data access pipelines need to be accounted for – can the customer send/receive the data at the speed necessary?

An oft-forgotten issue is the risk associated with choosing only one cloud vendor. If a cloud vendor suddenly decides to increase its prices by 20%, this can wreak havoc on an organization’s IT budget. Many organizations are now realizing the reality of vendor risk and are seeking solutions that provide multi-cloud support to eliminate that risk.

An Ideal Option

Zaloni’s DataOps platform, Arena, orchestrates the ingestion, transformations, tokenization and masking of sensitive/PII data, and provisioning to databases. Zaloni’s data lake management system provides an abstraction layer leveraging the native compute and storage of underlying infrastructure. Thanks to its flexible architecture, it can natively work with multiple cloud (or on-premises) infrastructures.

As of this writing, it is the only data lake management solution that can provide a layer on top of a multi-cloud environment. This is a key win for companies who have appropriate risk minimization goals.

Learn more about cloud data lakes and the architecture needed to realize them.

cloud data lake


about the author

This team of authors from Team Zaloni provide their expertise, best practices, tips and tricks and use cases across varied topics incuding: data governance, data catalog, dataops, observability, and so much more.