July 11th, 2018
Data lake cloud migration has a number of significant benefits including cost-effectiveness and agility. However, to see these benefits, it’s important to understand how to structure your data lake architecture in the cloud, which is a bit different than a traditional on-premises architecture. Also, moving to a cloud-based data lake or multi-cloud environment can’t (or really shouldn’t) happen all at once – it’s a journey that happens over time. Let’s explore some key benefits as well as the steps you need to consider to achieve a modern data architecture in the cloud.
The beauty of the cloud is its agility and flexibility. The cloud makes it possible to pay for just the compute you use. For example, you can start with a 20-node cluster and then easily increase to 100 nodes as your requirements change. You can also scale down as needed. With other models, you can pay only for a specific time; for example, if you want compute for two hours to run a back job, then you only pay for two hours.
When it comes to storage and compute, the cloud is different from an on-premise data lake. With on-prem, whether your cluster is in Hortonworks, Cloudera or Map R, the storage and compute are the same nodes. In other words, if you have a 100-node cluster, it stores the data as well as performs the compute. In the cloud, you have separate storage and compute services. This is because in the cloud, storage is cheap and compute is expensive. This separation requires slightly different thinking on your part when it comes to your data lake architecture.
In addition to on-demand processing, you get on-demand infrastructure with a cloud migration. You have the ability to start small, grow as needed and, if you encounter a scenario where you need to cut back, it’s easy to make that happen.
The cycles to refresh an upgrade can be long because there are so many dependencies and aspects that have to be planned out, including infrastructure, operations, and software. However, many cloud providers are adding services from vendors that make it easier to upgrade without impacting your overall solution. For example, we have clients with cloud-based data lake architectures that were able to upgrade to a new version of Hadoop in a matter of days.
Security and privacy in the cloud have always been a concern for enterprises. Our clients ask, is the cloud really secure? Can we trust and share data in the cloud? The move of financial and healthcare companies to the cloud has pushed cloud technology vendors to achieve certifications for security and privacy over the years. Today you’ll find that most compliance and regulatory requirements are already baked in and provided by the big cloud vendors.
A big advantage to moving to the cloud is that cloud vendors have regional, cross-regional or cross-country data recovery strategies and applications in place. This means you don’t have to maintain another data center to ensure resiliency in case of disaster.
Moving the data lake to the cloud is not an overnight process. Nor should it be, as every business has unique challenges. We see most enterprises move through four main phases of the journey from a traditional on-prem architecture to a modern architecture that leverages the cloud: greenfield, hybrid, full cloud, and multi-cloud. In the initial greenfield phase, a smart way to start is with a small set of use cases in a particular line of business and move that infrastructure to the cloud. Once you demonstrate success with a handful of use cases, you can use this to get additional buy-in from management to move to the next phase.
In the second “hybrid” phase, you have a percentage of your data in the cloud and the rest on-prem. You’ll need to spend a good amount of time to determine your strategy to get your team up to speed and processes in place to manage this new environment. You will need to build an integration with the on-prem platform, as well as the cloud in order to satisfy all user or client requirements.
In the third phase, you apply your learnings from the hybrid experience to get your data lake fully in the cloud, and reap the full benefits. However, even after you move to the cloud, your journey at this phase isn’t static, as technology is always changing and innovating. For example, when we started the AWS journey for some of our customers, there were transient clusters; now Amazon offers Spot Instances, which enable more cost-effective processing than even on-demand instances.
The final phase is a multi-cloud environment. In this phase, you have to think about how to make your platform agnostic to enable the movement of data between cloud providers and technologies, such as between AWS, Azure and Google. In addition, you’ll want to consider how to containerize all your applications so that you can take them up and down in a matter of minutes. We’ve been working with partners who are making multi-cloud much easier, with automation to stand up clusters and bring them down again in specific regions as needed.
For decades, most enterprises have had traditional data architectures for structured data, and this has served us well for quite a time. However, as data volumes grow and data types change, this architecture is no longer the most effective or even feasible when talking about streaming, unstructured or IoT data. The cloud is essential for a modern data environment and enterprises should be strategizing how to transition from a traditional data lake architecture to a cloud-based architecture. Now is the time to put the roadmap in place for implementation and cloud migration based on your enterprise’s needs and how fast you need to move to stay competitive in your industry.