Many companies move to the cloud for cost-effectiveness and scalability. Still, the cloud journey can be difficult and costly if companies don’t leverage elastic compute capabilities or don’t have proper data management processes in place. In this blog, I’ll cover how elastic compute can help organizations optimize their cloud data environment and manage and govern data in the cloud along with a couple use case examples.
Initially, organizations would lift and shift workloads from on-premises deployments to the cloud. These projects were mostly driven by IT departments and focused on infrastructure. By default, the best choice was to start with infrastructure as a service (IAAS).
Although these offloads may be cost-effective compared to on-premises environments, the IAAS approach does not leverage cloud computing’s full benefits. With the increasing demand for data processing, it soon becomes prohibitive to maintain (IAAS) in the cloud. In most cases, it becomes even more expensive to manage vs. in house deployment.
Most cloud providers encourage organizations to start using their platform as a service (PAAS) services and native cloud tools. These approaches provide huge benefits compared to IAAS services. The PAAS services save costs for workloads, which do not require full-time usage of a service.
For example, if an application requires MySQL database, instead of creating and managing a cloud VM in an AWS EC2 Machine, you can opt to use the RDS service, or in the case of Microsoft Azure, you can use the SQL Database service.
The PAAS service changes the billing model to usage-based instead of a dedicated machine. Additionally, the PAAS service saves the effort required to patch, manage, and scale the database service since it is taken care of by the cloud provider.
Today we are faced with an entirely new set of challenges when trying to process data in the cloud. Some of these challenges include:
So let’s look at how to solve these problems by changing the data management mindset. I like to think about the analogy of changing from creating pet servers to herding cattle. You can read more about Pets Vs Cattle analogy here. This means that we develop our data pipelines in a transient way. We are able to fire up the data processing infrastructure on demand and destroy it once it is no longer required. Companies like DataBricks address this by autoscaling existing running clusters when needed and reducing them to minimal when not needed. Some solutions may tie you up with services from a specific cloud provider.
Another way to solve this would be to have a single control plane, which can be cloud-agnostic, to catalog, control, and consume data from sources to any destination in a governed and managed way—giving organizations the ability to create and destroy the required infrastructures on demand. Zaloni’s Arena is a distributive solution which addresses the challenges of cloud sprawl head-on.
Most of the big data solutions based on Hadoop collocate compute with the storage. Although the distributed file system allows scalability, it’s impossible to scale compute separately from storage, becoming even more apparent when implementing big data solutions in the cloud. As more and more organizations migrate their analytical workloads to the cloud, it is now possible to avail object stores like Amazon S3 and Azure ADLS, allowing independent compute and storage scaling.
Fig. 1 Storage is Collocated within Compute Nodes
Fig. 2 Compute can be scaled separately from storage because the data is not stored within the compute nodes but on an object store service like ADLS or S3.
Now that we have been able to disassociate storage from compute nodes, let us look at a few options by which we can achieve elastic compute for big data workloads. The following are examples of implementing elastic compute in Azure:
Microsoft Azure provides the capability to provision HDInsight clusters attached to Azure Data Lake Storage (ADLS). The HDInsight cluster can be scaled up and down via API calls to Azure services. Since the data is stored out of the cluster and in ADLS, it is possible to terminate the cluster without losing the data and recreating the compute cluster on demand. We do however, need to make sure the metadata is also stored in an external SQL database.
Fig. 3 HDInsight with Autoscaling
There are a few drawbacks in this scenario, as the HDInsight cluster is designed to be permanently running. It also takes time to spin up and spin down a cluster. On the other hand, if there are long running processes requiring the compute cluster to be available around the clock, this scenario will be a better fit.
Fig. 4 Creating an HDInsight Cluster
Another approach is to use Databricks instead of running Spark on HDInsight. Azure Data Bricks does not require a permanently running cluster and the compute capacity can be instantiated on demand. Azure Databricks can be useful Data Science types of use cases where the experiments may require on-demand compute service when required and not necessarily a fully running cluster all the time.
Fig.5 DataBricks with ADLS
Docker and Kubernetes bring yet another dimension to compute services. With Kubernetes, it’s possible to package the processing logic in Docker-based applications, making it cloud independent with the elastic capabilities of the Kubernetes architecture.
Fig. 6 Azure Kubernetes Services
Unlike Hadoop (HDInsight) or Spark (Databricks), Kubernetes allows the application developer to choose the language, libraries, and execution environment for each application and does not have to follow a particular stack necessarily. The application environment and the processing code is defined in Docker files and a Docker image is created. This image is then deployed to a Kubernetes cluster creating multiple execution pods. This method allows at scale processing of data while the data still resides within Azure Data Lake Storage.
The following are examples of implementing elastic compute in AWS:
The first option is to use the AWS flavor of Hadoop, AWS EMR. In order to achieve the separation between compute and storage we can use AWS S3 buckets for data storage and EMR to process the data. S3 can be mounted to EMR as additional storage. In this case the metadata can be stored in RDS service so that if the EMR cluster is destroyed and recreated the metadata is not lost. The cluster can be set to auto scaling the EMR cluster.
Another option is to use Athena, Athena is a serverless service which queries S3 data without having to spin up dedicated compute nodes. It scales as per the requirements of the query language. It supports standard SQL. Athena is integrated with AWS Glue Catalog out of the box. Glue stores metadata information about the datasets in S3.
So how do you manage your data to support elastic computing and only use the cloud when needed? The data management practice of DataOps, brings together concepts from agile software development and DevOps to provide end-to-end visibility and control across your data environments and the supply chain.
Wikipedia defines DataOps as an automated process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. By applying the concepts of agile development, DevOps, and data management together we can start solving some of the most challenging cloud data problems.
With a DataOps platform, you can connect to your data, catalog the data, run data quality, create pipelines, then version and process precisely how you would manage a mobile application’s development. DataOps also apply the concepts of continuous integration and delivery into data management.
Arena by Zaloni is a DataOps platform which includes an active data catalog, standardized governance and enables self service data consumption and enrichment. Using Arena’s provisioning capabilities it is not only possible to provision data but also a compute service. With Arena you can bring the power of infrastructure as code, DevOps, and data management within the same platform.
Arena integrates with multiple cloud providers and on-premises data systems from a single control plane. The workflows in Arena allow dynamic provisioning of compute nodes in AWS and Azure. It is possible to use ARM Template or CloudFormation template to orchestrate the on demand deployment of any of the above scenarios. This brings you a multi-cluster and multi-cloud experience. A end-user such as a data analyst or data scientist can provision or lease datasets for his analysis using a self-service marketplace experience.
Fig. 7 Example of a Multi-Cluster Hybrid Deployment of the Arena Platform.
Today we live in a world where we are facing new challenges every day. Let’s use the example of one of the biggest challenges being faced by humanity today, the Coronavirus pandemic. Let’s assume a government organization is tasked to contain and manage the spread of Coronavirus.
A telecommunications operator already has the big data stores in place. Now they need to create a special task force to extensively query the geo location data for people who came from overseas on a particular flight. This requires processing the data for each of the mobile devices which were active on the network for the last two weeks. In order to identify the potential spread, this needs to happen very quickly and requires extensive processing.
Procuring new hardware will take months to set up in an on-premise environment with traditional data processing. Even if this was all in the cloud, setting or extending an existing big data cluster has its own challenges of cost and maintainability. Plus you would still need to get budget approvals to expand in the cloud.
By leveraging a DataOps platform, like Arena, analysts are able to search and find data easily from the data catalog or marketplace experience then provision the data to a just-in-time, on demand elastic compute cluster in a self-service manner. This helps to reduce the time to insight while reducing costs through elastic compute.
In addition to the self-service data catalog and provisioning, Arena is able to add an approval process during data provisioning where the analyst submits an approval request providing clear business justification to access the data and spin up the cluster for a specific amount of time.
Let’s assume you are an insurance provider and would like to share secured data with third-party research organizations across the globe but you do not want to give them access to your internal systems. You would like to lease the data along with the analytical tools to the third-party organization or subject matter experts for a specified period of time.
With Arena, you can provision the required data along with any compute or processing infrastructure to query the data in an isolated secured environment. Ability to get approvals from the dataset owners to share the data and then control how it is used and set the time limit on when the access should be revoked.
This capability within the Arena platform simplifies and secures external data sharing, enabling use cases such as external data marketplaces or sharing data with third-party organizations.
In the end, using on demand elastic compute along with a DataOps platform which can integrate with multiple cloud providers to enable elastic compute can make a significant impact on data security, time to insight and cost reduction.
To learn more about Arena and to register for a live demo visit, zaloni.com/demo.
Blogs By: Haley Teeples
News By: Annie Bishop
Blogs By: Matthew Caspento