September 26th, 2016
Data lakes make more sense when you think about the architecture in zones. Don’t miss this encore lecture from Ben Sharma, CEO and Co-Founder of Zaloni. Ben uses illustrations of a reference architecture to describe the concept of 4 zones for envisioning the data lake:
Transient landing zone
By understanding the inputs, outputs, processes, and policies within each zone, you can take your implementation further, evaluating a holistic approach and rethinking the possibilities when it comes to build vs buy for the future of your data lake management.
It’s almost always the case that it’s cheaper to buy your data management than it is to build your own. Learn the ROI that Zaloni’s DataOps platform, Arena, can provide, modernizing your data lake architecture. Request your demo today!
Read the webinar transcription here:
At Zaloni we bring in software platforms that help our customers. Create managed and governed data lakes and then we also help them with the delivery and deployment of those data lake platforms into production. So a lot of the talk that you’ll see today is based on our experience doing these things in production for customers in different verticals financial services. Healthcare telecommunications industrial retail and others. So I’ll use a lot of examples of how data lakes are being built and deployed. And then we’ll also look at some of the emerging areas where we see a lot of engagement right now with customers, where they’re taking a cloud first approach right so they are actually starting to build these data lake platforms and cloud, or they have a hybrid approach where their son or someone on prem implementations and they’re starting to build out cloud based implementations as well. So what are some of the things you need to think about from an architecture and design standpoint as you go down that path. So there is a lot of excitement with data lakes, right. So, but what are some of the reasons why there is excitement because CIO or CTO of an organization, when they’re looking at this technology on behalf of Chief analytics officers, or the Chief Data officers for various different use cases. It is ultimately creating an agile data platform.
Right, so the traditional way of doing things took too long were too It was too expensive. patterns are changing where you need to rapidly, be able to onboard data be able to drive insights out of the data. So agility is one of the fundamental drivers, I would say, where we see customers building and deploying these data lakes. And then also, historically, especially in different verticals like telecom and financial services there’s already so much data that they have been collecting over the years, but they have to kind of trim down the data periodically so how can you now bring in all that data stored historically over a period of time and generate some of the new insights that you didn’t have before. And then as you build these systems, how you, how can you bring in new types of data, whether it is data from external environments, or third party providers, so that you can create a scalable data platform to provide these insights in a shorter time. So, before we go into the architecture details so let’s define at a high level, what we consider a data lake. Right. So one of the fundamental requirements. That is underscored by scalar technologies like Hadoop is where you’re creating a schema less environment where you’re able to bring in lots of different types of data and store that data for an extended period of time. You are then able to take it further, as, as part of a maturity process from a raw set of data sets that you’re bringing in from various sources into what we consider trusted data sources, and I’ll talk about that in a lot more details. And then also create refined data sources for specific use cases. And at the same time be able to provide a discovery sandbox where you can bring in your consumers to be able to do ad hoc exploratory analytics on that data. The other pattern. As you’re creating the data lake is to be able to store that data for extended period of time. So you need to think about how do you retain that data, what are some of the constraints, you may have. As part of the storage there, or as part of cost, if you will, and then be able to query the data.
So, earlier, we were familiar with relational databases where the query patterns were strictly relational structure or relational maybe to be able to query the data. That’s no longer the case with data lakes right so you relational, or sequel based access is one of the options but it’s not the only option, you’re able to write a lot of logic, where you can iteratively go over that data, generate insights based on machine learning and other algorithms that are running natively on the data lake. And then one other pattern that we see is that now the data lakes are evolving where it is becoming more and more converged with, not just batch type of use cases, but also continuous streams of data being processed in the same converged infrastructure. So what are some of the architectural considerations you need to think about as part of kind of building such environments, is what we’ll cover. But ultimate goal is to be able to shrink your time to insight So that from the time the data lands to the time you’re able to generate the insight. You’re not going through an extensive set of processes or delays, but you need to have proper data management and governance, where you’re creating these reusable data pipelines that you can leverage in the data lake. Alright, so that’s kind of the promise. So what are some of the changing patterns. So, I’m sure a lot of you have been working on the relational platforms. So you’re familiar with the traditional way of doing things, where you have the source systems in an enterprise, where you go through an extensive ETL process where you would actually define what data sets needs to be created, as you’re onboarding that data into an enterprise data warehouse, and then you’re creating data Mart’s out of that environment so that you’re creating the serving their data sets, as cubes or whatever that you’re providing to your downstream use cases. So what’s changing, is that now you’re including the data lake in this enterprise architecture, and it is becoming the cornerstone of the next generation data platform where you’re now able to onboard the data from many different systems into the data lake, and you have this concept of zones, we’ll talk about that in details. And then from there, you’re able to serve a lot of new use cases directly out of the data lake, but then you’re also in use the data lake to do the heavy lifting. To be able to serve some of the legacy applications and use cases that may still be served out of a data warehouse. Right. So keep this in mind, but in doing so, we see challenges challenges that I’ve tried to kind of group them in three different areas.
So we see customers struggling with how do I build the data lake, what are the technologies do I, what are the set of technologies that I use the ecosystem is changing very rapidly. Now there is blank and other things that weren’t covered a year back let’s say or two years back, their skills gap, obviously, and that’s where we’re here to learn about new technologies and then there is inherent complexity in putting these systems in place, you have to think all the way from the infrastructure layer to the platform layer to the serving layer with proper data management and governance that needs to be in place. So similarly we see challenges in managing these data lakes in terms of having a repeatable process for ingestion or having visibility into the data in terms of what data is coming in, and obviously security and privacy and compliance are paramount in these environments. But last but one of the key things that is very important to consider as you’re thinking about the data lake is then how do you get value out of the data lake right so this is where engaging with your business and delivering value of the data lake comes in place, where you need to be able to enable some of the self service capabilities and reduce the reliance on it, and provide reusable patterns, out of the database.
So that’s kind of the high level view in terms of the different zones in the data lake. so this is kind of how we think about it from a source system ingestion standpoint you need to have different ways to ingest the data. And then you need to have various capabilities that are foundational in terms of metadata management data quality cataloging and security and governance. Alright, so I’ll change gears a little bit. So now you need to think about, how do you holistically build this data lake, right, and we try to map it into three main categories. One is enabled the data lake. Next is governed the data lake. And then how do you engage the business. So this is where in the and enabling the data lake you need to put together platforms or frameworks to be able to do manage ingestion and metadata management, because that’s vital for creating this foundational layer, so that you can then leverage your data in various ways that we talked about, when there are various different attributes here, where you need to be able to ingest different types of data, whether it’s batch or streaming. Being able to kind of map it to a repeatable pipeline, so that you’re able to monitor how much data came in when. What was the quality of the data and things like that. Similarly, you need to think about governing the data lake once you enable the data lake and this is where you need to capture metadata to provide the lineage, as the data entered the data lake. And as it got transformed to the data lake, a lot of times we also see requirements where customers are bringing in source data from some external systems, and they may also want to feed the data lake, the lineage information that came from the systems so that you have an end to end view of the lineage. So you need to think about how would you integrate, not just with the data that is coming in from other sources systems but also with metadata that is coming in from the source systems. So that’s important.
And then, thinking about data quality, so I think this is an often ignored topic in these discussions. And we see this is so vital in terms of being able to build production grade data platforms where you need to put a lot of emphasis in terms of how you would ensure data quality, how you do it from a business centric view, so that you’re not having an IT shop. Go in and develop a bunch of code, every time you need to onboard a new data set. right so you need to think about metadata based data quality definitions, based on rules that you can then enforce on the data as it is being on boarded into the data lake and being able to capture metrics. As you ingest the data so that you’re able to make decisions in terms of whether you want to run your downstream pipeline. On this data or not, because you may be getting some junk data. And you may not want to do further processing on the data. So being able to separate good records from bad records and being able to automate that whole pipeline is very important and something to think about.
So, moving along this circle. So then, data security and privacy. I mean, we all understand. This is a key area where you need to spend time where you’re not just providing policies at the data access layer on who can access the data and what can they do with the data, but you’re also defining role based access control and entitlements at the data management learn so important to kind of make sure you have an end to end, security and governance story there. And one other key area where we’re seeing more and more requirements, is that, as these data lakes are growing and customers are able to bring in a lot of volume of data, how do you provide a cost effective way of keeping this data over a period of time. And that is where you need to think about. Hot warm and cold areas in the data lake. So even in the zones that I define there, you need to think about, how do I keep my hot data and warm data and cold data in different storage tiers because you can actually take advantage of some of the storage technologies that are out there. And how do I do that in the metadata level, at the business metadata level so that I can define policies and I can then enforce those policies on the data. And last but not least is you need to bring in your end users or the consumers to the data lake so this is where data catalog is vital. And then some of the capabilities in terms of self service data preparation, where you can wrangle the data and you can feed the output of the wrangling process into an operational layer, so that you can automate it quickly, capturing the metadata enforcing the governance models and things like that. All right, so that’s kind of the key in terms of what we think about from a data lake architecture.
We are seeing an emergence of cloud based platforms so some of them are cloud only or cloud first, and then also hybrid cloud approach. Right. So you need to think about, how do you provide data privacy and security across those different environments. And then how do you take advantage of some of the cloud native features so that you’re not just trying to lift and shift, an existing on prem Hadoop cluster or data lake environment into the cloud, but you’re actually taking advantage of some of the elasticity and cost effective ways of managing the data in the cloud. And while doing so how do you have a consistent data management and data governance layer. So, I mean these are some of the things to consider. I won’t go through all of them, but keep in mind that you need to think about a cloud agnostic strategy, at times, because you may need to deploy, depending on your use case you may need to deploy in multiple cloud environments for supporting different geographies, for example, we have a customer, where data that is created in a given country has to stay in that country. So now you need to find a cloud provider who has presence in that country, so that you are able to support those type of use cases. So you may not be able to just stay with one cloud provider. You may need to think about multi cloud. And how do you enable those kinds of use cases. So I have put together. Sorry, put together a kind of cloud Lake data lake maturity model. So first is we see folks who are having on prem clusters, they’re just trying to move that into the cloud, using the infrastructure as a service layer may just do a lift and shift to kind of get it off of their infrastructure or hardware into Cloud providers hardware. But then we also see some of the Greenfield applications start with cloud native features, where they’re taking advantage of the elasticity they’re taking advantage of some of the optimized storage platforms in the cloud. And you can gradually migrate to that from a lift and shift approach to cloud native features. But then as you grow in this Maturity Model, you need to think about hybrid cloud, and multi cloud. So if you have different cloud service providers if you have on prem and cloud. How do you provide a consistent layer of data management and data governance, so that your applications can be portable across these different environments, right so containerization and other things help us at the application layer, but you also need to think about the data layer, and making it generic enough with an abstraction, so that you can enable the use cases. All right. so how do you get started. So this is kind of a blueprint that we use. So, first of all, make sure there is a business value of what you’re trying to do.
Right. So what are the business drivers. And what are the business questions that you’re trying to answer. And then maybe think about the data sets that are needed to solve those business use cases. What are those concrete use cases that you can define. And what does the platform look like, and then create a roadmap based on that and create the managed data lake platform, and then define how it maps into your long term analytic strategy as you become a data driven organization. Timeline wise, these are some of the numbers that we have seen, but it varies depending on how fast or how slow your various processes may be, so being able to do some quick PLCs, and then being able to stand up a data lake platform, and then start building the use cases on top of it is kind of a natural progression of how you can go along this journey.