Blogs

8 Key Considerations for Migrating your Data Lake to the Cloud

Avatar photo Team Zaloni November 14th, 2018

When it comes to cloud-based data lakes, we find that most enterprises are still in the planning phase or have a hybrid environment. For companies considering how to make the move to the cloud, we like to emphasize that it is not a “lift and shift” process, where you simply move your existing infrastructure to the cloud. Instead, moving to the cloud is a longer-term journey that requires a different architectural approach. To help companies successfully transition to the cloud, following are some recommendations on how to address common challenges, and what you need to consider to inform your cloud strategy.

1. Pick and choose your use cases

Focusing your initial efforts on a few use cases that could benefit from the cloud is a smart way to get started – to show value to your business and also demonstrate your forward-looking strategy in action. As an example of this, one of our clients decided to offload data to the cloud for a data science greenfield project. In just a few weeks, they had moved datasets onto AWS S3, which made it much easier to connect their Amazon Virtual Private Cloud (VPC) and for their vendors to securely share data. The outcome was that the cloud gave the team the high compute they required and they were successfully able to show results.

2. Determine your cloud strategy

Today when we talk about “the cloud,” it can be a single cloud or multi-cloud model. Many of our clients want the flexibility to work with multiple cloud technologies, such as Azure, AWS and Google. A multi-cloud environment can provide more flexibility and give you solutions to best fit specific projects and workloads. However, you’ll have to determine what size cluster you’ll need to meet SLA commitments. Even though clusters in the cloud can be dynamic, you do have to select a cluster size that will meet your requirements for specific jobs. These are strategic decisions that should be made at the chief architectural level.

3. Choose cloud-native technologies – or not

With a single-cloud environment, it typically makes sense to choose native components. For example, with EMR for Amazon and HDInsight for Azure, the benefit is that those technologies are tightly integrated with the rest of the ecosystem, and can make your solution much more effective, simplified and enable more accelerated development. However, you also have to consider that this means your application is dependent on those services. It’s important to think this through and determine what is the right approach for your business.

4. Consider interconnectivity

As part of your journey to the cloud or a hybrid environment, it’s critical to determine what systems need to connect with each other; for example, what are your data sources that need to talk to the cloud and what are the datasets that need to be moved from cloud back to on-prem, etc. One way to determine your integration requirements is cost – although storage may be cheaper in the cloud, transferring data between on-prem and cloud can be very expensive. Also, your costs in time – for example, how long will the data transfer take?

5. Leverage transient capabilities

Instead of having a cluster run 24/7 in Hadoop on-prem, the cloud enables transient clusters, which give you compute on-demand. This pay-for-what-you-use model is great and can reduce overall costs, but it does bring some challenges with metadata management, which need to be considered. Amazon also offers Spot Instances for AWS, which can give you spare compute capacity at significant discounts.

6. Apply a data management layer

In the cloud, where your storage and compute clusters are separate, a management layer is essential for managing metadata and business definitions that are being sent to the cluster. With transient clusters, this is particularly important, as metadata is automatically deleted by the cloud provider when a transient cluster is shut down. By applying a data lake management platform, you can maintain the metadata as you would with a persistent cluster, to understand what tables are in the data lake, where they are located and how are they structured.

In addition, a data management platform is needed to track data lineage, particularly in a hybrid environment, where the data lifecycle can be very complex as data comes from various sources and travels to different clusters and is combined and enriched. Managing the data lifecycle and tracking lineage is very important for compliance purposes – you need to be able to trust your data and understand where it comes from.

7. Review data security in the cloud

In an on-prem environment, it’s standard to set up Sentry or Ranger policies to control access to data. However, depending on which cloud vendor you use, security standards can differ and it’s important to think through how you will maintain the same data governance policies across a hybrid or multi-cloud environment – and a data management platform can help you with that. Fortunately, most of the large cloud vendors have addressed security and regulatory requirements, but it’s a good idea to understand the specifics of what they offer and how you’ll integrate with them for seamless protection. One aspect to look for with your cloud vendor is that data is encrypted both at rest as well as in transit.

8. Augment your skill set

DevOps for an on-prem environment is different than DevOps in the cloud, and your team will need to develop or hire new skills. Although management of the cluster is easier on the cloud thanks to vendor services, your team does need to know how to deploy code, as well as start and stop a cluster, for example. Fortunately, there is a lot of information and training available through vendors and partners. However, training and hiring takes time and this needs to be factored into your overall strategy.

Start your data lake journey

Moving your data lake to the cloud is not a quick process; it is a longer-term journey that requires strong leadership and a well-thought-out strategy. However, it can be accomplished in phases, making the transition a more realistic and successful endeavor, particularly if enterprises put the right technologies in place, including a data management platform. Zaloni’s DataOps platform, Arena, makes it 75% faster to build and scale a data lake and delivers radically higher ROI compared to starting from scratch – particularly for a hybrid environment. Our software and services can help you modernize your data architecture and streamline your digital transformation. Contact us to learn more.

about the author

This team of authors from Team Zaloni provide their expertise, best practices, tips and tricks and use cases across varied topics incuding: data governance, data catalog, dataops, observability, and so much more.