February 9th, 2016
When it comes to big data, more and more enterprises are beginning to embrace the cloud for its flexibility. Many of these companies are adopting a hybrid approach to their big data lakes, looking for ways to leverage efficiencies and opportunities of cloud-based applications and storage alongside their on-premise data.
A data management platform such as Zaloni’s Arena can span on-premise and cloud-based computing across the enterprise. By capturing the metadata needed to implement consistent data management and governance processes, companies are able to confidently use the cloud as a Hadoop data lake for core use cases that couldn’t be considered before.
As you know, one of the cloud’s advantages is its rapid elasticity—the ability to provision and pay for just the resources required in real time, on the fly. Also, storage and compute services can be decoupled in the cloud. Both of these capabilities have significant implications in terms of lowering the barrier to entry for companies of all sizes to derive value from big data.
Let’s take a look at how these capabilities, combined with a data management platform designed for Hadoop make the cloud a good option for your data lake and future growth.
With cloud services like Amazon Web Services (AWS) EMR or Microsoft Azure HD Insight, companies can spin up and scale Hadoop clusters as business demands. However, maintaining persistent clusters can be expensive, particularly for proof of concept projects and sandbox environments that may not produce a return on investment. Decoupled storage and compute make transient clusters—which automatically shut down and stop billing when processing is finished—a more cost-effective option. This allows administrators to work complex, repeatable workflows on the most comprehensive data sets in the most economical manner.
Today, processing requirements can be variable. Customers no longer need to duplicate data for the sake of accessing compute. By using a data management platform that maintains metadata, customers can scale up processing without having to scale up or duplicate storage. In addition to needing less storage, when storage and compute are separate, customers can pay for storage at a lower rate, regardless of computing needs. Cloud service providers like AWS even offer a range of storage options at different price points, depending on accessibility requirements.
When a transient cluster is shut down, the metadata is automatically deleted by the cloud provider. To gain the greatest value from transient clusters, use a data management platform to monitor ingestion of the data that’s being loaded to the cluster and store the resulting metadata outside EMR/HD Insight so that it’s available after the cluster is terminated.
Metadata is what allows business users to confidently access and use data. With a data management platform, data residing in S3 or Azure Storage is automatically cataloged and users can easily provision serving layer data stores like Amazon Redshift for rapid data discovery and consumption. Also, as the number of business users increases, metadata enables companies to execute enterprise-wide data governance strategies for the management and use of data.
It’s important to note that although capturing unstructured and semi-structured data in AWS or Microsoft Azure is relatively straightforward, these cloud service providers do not offer an easy way to also capture metadata. Metadata is essential for managing, migrating, accessing and deploying big data—and leveraging many of the coveted benefits associated with the data lake architecture. A robust data management platform is an essential component of your data lake that enables companies to implement consistent data management and governance processes across the environment.
A few weeks ago, I presented on a webinar on this topic with Ben Lorica of O’Reilly Media. You can access the recording here.
For more information about data lake management and governance, and how to get the most from your data lake in the cloud, contact us.