November 15th, 2018
If you haven’t already, you may be considering building or expanding your data lake on a public and/or private cloud. As the volume and type of big data continues to grow, the cloud makes financial sense and provides much-desired on-demand processing and storage scalability. With no one-size-fits-all approach, organizations are modernizing their data platforms using multiple deployment models, such as on-premises, hybrid, cloud, and multi-cloud.
Yet, integrating the cloud into your data lake ecosystem can be complicated. One key challenge is managing and governing the data that spans on-premises and cloud-based computing across the enterprise. Although capturing unstructured and semi-structured data in public cloud platforms such as Amazon Web Services (AWS) or Microsoft Azure is relatively straightforward, these providers do not sufficiently capture metadata. Metadata is essential for managing, migrating, accessing and deploying big data – and leveraging many of the benefits associated with the data lake architecture.
The Zaloni Arena platform enables centralized, consistent data management and governance of all of your data across platforms and systems, in the cloud or on-premises. By capturing metadata as data is ingested into the data lake, regardless of platform, Arena enables organizations to confidently use the cloud as a data lake for core use cases that couldn’t be considered before. These include customer 360, fraud analytics, data monetization, and others. Arena is a fully integrated solution that also allows organizations to provide self-service access by business users to their data lake.
One of the cloud’s foremost advantages is rapid elasticity – the ability to provision and pay for just the resources required in real time. With cloud services like AWS EMR or Microsoft Azure HDInsight, companies can spin up and scale Hadoop clusters as business demands. Additionally, decoupled storage and compute enable transient clusters, which automatically and cost-effectively shut down and stop billing when processing is finished. Using Arena, administrators can manage all MapReduce or Spark jobs with an intuitive user interface across all available clusters, regardless of their location or distribution. Administrators are then able to efficiently work complex, repeatable workflows on complex data sets.
Arena enabled near real-time synchronization for certain reference data, one-time load for historical data and full refresh for smaller tables. The team could configure the list of sources and tables and schedule the refresh frequency for each table. Arena also provided advanced configurations on mapper count, split by column, primary key selection, and type of refresh (incremental/full/change data capture). In addition, the solution provided a framework for the future on-boarding of other sources without needing additional development.
When a transient cluster is shut down, the metadata is automatically deleted by the cloud provider. To gain the greatest value from transient clusters, Arena monitors ingestion of the data that’s being loaded to the cluster and stores the resulting metadata outside the cloud platform so that it’s available even after the cluster is terminated.
No matter how you deploy your data lake (e.g., on-premises, hybrid, cloud, multi-cloud), consistent, enterprise-wide governance is critical to managing key big data challenges, including:
Automation of repeatable management tasks and processes is essential at the scale of big data. Arena allows organizations to operationalize data along the entire pipeline, from data source to data consumer. In addition, to allow access for business end users, Arena provides a self-service data catalog, which reduces reliance on IT and speeds time to business insight. With Arena, data residing in S3 or Azure Storage is automatically cataloged, and users can easily provision serving layer data stores like Amazon Redshift for rapid data discovery and consumption.
The cloud is a powerful, cost-effective platform that is lowering the big data analytics barrier to entry for companies of all sizes. With the flexibility that multiple deployment models provide for data platforms, most organizations today have the ability to derive significant value from a data lake in the cloud while still integrating with existing systems and retaining tight control over sensitive data.