Webinars

Data Lake Architecture: The Four Zones

September 26th, 2016

Data lakes make more sense when you think about the architecture in zones. Don’t miss this encore lecture from Ben Sharma, CEO and Co-Founder of Zaloni. Ben uses illustrations of a reference architecture to describe the concept of 4 zones for envisioning the data lake:

Transient landing zone
Raw zone
Trusted zone
Refined zone

By understanding the inputs, outputs, processes, and policies within each zone, you can take your implementation further, evaluating a holistic approach and rethinking the possibilities when it comes to build vs buy for the future of your data lake management.

 

secure the unrealized power of your data click for demo

It’s almost always the case that it’s cheaper to buy your data management than it is to build your own. Learn the ROI that Zaloni’s DataOps platform, Arena, can provide, modernizing your data lake architecture. Request your demo today!

Read the webinar transcription here:

At Zaloni we bring in software platforms that help our customers. Create managed and governed data lakes and then we also help them with the delivery and deployment of those data lake platforms into production. So a lot of the talk that you’ll see today is based on our experience doing these things in production for customers in different verticals financial services. Healthcare telecommunications industrial retail and others. So I’ll use a lot of examples of how data lakes are being built and deployed. And then we’ll also look at some of the emerging areas where we see a lot of engagement right now with customers, where they’re taking a cloud first approach right so they are actually starting to build these data lake platforms and cloud, or they have a hybrid approach where their son or someone on prem implementations and they’re starting to build out cloud based implementations as well. So what are some of the things you need to think about from an architecture and design standpoint as you go down that path. So there is a lot of excitement with data lakes, right. So, but what are some of the reasons why there is excitement because CIO or CTO of an organization, when they’re looking at this technology on behalf of Chief analytics officers, or the Chief Data officers for various different use cases. It is ultimately creating an agile data platform. 

Right, so the traditional way of doing things took too long were too It was too expensive. patterns are changing where you need to rapidly, be able to onboard data be able to drive insights out of the data. So agility is one of the fundamental drivers, I would say, where we see customers building and deploying these data lakes. And then also, historically, especially in different verticals like telecom and financial services there’s already so much data that they have been collecting over the years, but they have to kind of trim down the data periodically so how can you now bring in all that data stored historically over a period of time and generate some of the new insights that you didn’t have before. And then as you build these systems, how you, how can you bring in new types of data, whether it is data from external environments, or third party providers, so that you can create a scalable data platform to provide these insights in a shorter time. So, before we go into the architecture details so let’s define at a high level, what we consider a data lake. Right. So one of the fundamental requirements. That is underscored by scalar technologies like Hadoop is where you’re creating a schema less environment where you’re able to bring in lots of different types of data and store that data for an extended period of time. You are then able to take it further, as, as part of a maturity process from a raw set of data sets that you’re bringing in from various sources into what we consider trusted data sources, and I’ll talk about that in a lot more details. And then also create refined data sources for specific use cases. And at the same time be able to provide a discovery sandbox where you can bring in your consumers to be able to do ad hoc exploratory analytics on that data. The other pattern. As you’re creating the data lake is to be able to store that data for extended period of time. So you need to think about how do you retain that data, what are some of the constraints, you may have. As part of the storage there, or as part of cost, if you will, and then be able to query the data. 

So, earlier, we were familiar with relational databases where the query patterns were strictly relational structure or relational maybe to be able to query the data. That’s no longer the case with data lakes right so you relational, or sequel based access is one of the options but it’s not the only option, you’re able to write a lot of logic, where you can iteratively go over that data, generate insights based on machine learning and other algorithms that are running natively on the data lake. And then one other pattern that we see is that now the data lakes are evolving where it is becoming more and more converged with, not just batch type of use cases, but also continuous streams of data being processed in the same converged infrastructure. So what are some of the architectural considerations you need to think about as part of kind of building such environments, is what we’ll cover. But ultimate goal is to be able to shrink your time to insight So that from the time the data lands to the time you’re able to generate the insight. You’re not going through an extensive set of processes or delays, but you need to have proper data management and governance, where you’re creating these reusable data pipelines that you can leverage in the data lake. Alright, so that’s kind of the promise. So what are some of the changing patterns. So, I’m sure a lot of you have been working on the relational platforms. So you’re familiar with the traditional way of doing things, where you have the source systems in an enterprise, where you go through an extensive ETL process where you would actually define what data sets needs to be created, as you’re onboarding that data into an enterprise data warehouse, and then you’re creating data Mart’s out of that environment so that you’re creating the serving their data sets, as cubes or whatever that you’re providing to your downstream use cases. So what’s changing, is that now you’re including the data lake in this enterprise architecture, and it is becoming the cornerstone of the next generation data platform where you’re now able to onboard the data from many different systems into the data lake, and you have this concept of zones, we’ll talk about that in details. And then from there, you’re able to serve a lot of new use cases directly out of the data lake, but then you’re also in use the data lake to do the heavy lifting. To be able to serve some of the legacy applications and use cases that may still be served out of a data warehouse. Right. So keep this in mind, but in doing so, we see challenges challenges that I’ve tried to kind of group them in three different areas. 

So we see customers struggling with how do I build the data lake, what are the technologies do I, what are the set of technologies that I use the ecosystem is changing very rapidly. Now there is blank and other things that weren’t covered a year back let’s say or two years back, their skills gap, obviously, and that’s where we’re here to learn about new technologies and then there is inherent complexity in putting these systems in place, you have to think all the way from the infrastructure layer to the platform layer to the serving layer with proper data management and governance that needs to be in place. So similarly we see challenges in managing these data lakes in terms of having a repeatable process for ingestion or having visibility into the data in terms of what data is coming in, and obviously security and privacy and compliance are paramount in these environments. But last but one of the key things that is very important to consider as you’re thinking about the data lake is then how do you get value out of the data lake right so this is where engaging with your business and delivering value of the data lake comes in place, where you need to be able to enable some of the self service capabilities and reduce the reliance on it, and provide reusable patterns, out of the database. 

So this is kind of a reference architecture that we use, as we build data lake platforms in production. So on the left hand side what you see are the different sources of data, and then on the, on the box. What we’re trying to do is carve out different zones in the data lake. So you start with a transient landing zone, which may or may not be needed. Depending on your use case so in some use cases you may need to have a transient landing zone. But in other use cases you may go directly to the raw zone. So then, you do have a raw zone where you’re storing your data in its raw original format for an extended period of time, and but while you’re storing this data, some of the sensitive attributes are being masked and tokenized and created so that you were making the data consumable based on compliance policies. And then we think about the data lake in terms of a trusted zone where now you are taking the data through a standardization validation process. And you’re certifying certain data sets, so that they can be consumed. Out of the data lake across the enterprise right so that we consider as the trusted zone. And I’ll go through each of these sections in detail so that you’ll see a little bit more deeper view about like what are some of the considerations you need to have. Then we also consider a refined zone, which is where you’re creating use case specific derived data sets out of the trusted data sources that you’re bringing in right so these may be very LLP specific line of business specific data sets that are great. And then we always see a need for a sandbox area where you’re able to consume data from any of the other zones, and also are able to bring in your own data to be able to do ad hoc exploratory use cases in a less governed way. But then there could be a promotion process of taking some of the results and putting them back in the database. So let’s look at each one of them in a little bit more details. So what does the transient landing zone lapply. So, from our point of view, the transient landing zone is where you have a temporary area for landing the data from source systems as it is coming in. And typically you have limited access to the transient landing zone because it’s not a consumable area out of the data. But we see a need for this in highly regulated industries where there is compliance and regulatory requirements that needs to be met before the data is made made consumable. And this is where you apply certain policies to tokenize and mass sensitive attributes before you make them available in the raw zone. So basically what I’ve done is categorized them by inputs, so the inputs are various types of data that you’re bringing in outputs, out of this loan, goes to the raw data zone. This is an optional zone. From our point of view, in the data lake. And these are some of the processes that you need to do in the transient landing zone where you’re creating an intake process that is repeatable, you’re discovering some of the metadata as the data is coming in, you’re registering the data in the catalog, you’re applying zone specific policies that are defined on the right hand side, and you’re capturing some of the operational metrics and starting to do some of the post injection validation. So what are some of the policies, you need to think about. So you need to think about your security policies, you need to think about your data privacy where you need to mask and tokenize certain attributes. You may need to think about quality at a coarse granular level where you may want to see the file that came in, that have come in with the integrity that you’re expecting may not be at the record level, or the field level, but maybe at the file level. And then you may think about data lifecycle policies, where this data is short lived. It is just there, so that you can populate the raw zone. So, as soon as you populate the rows on you can read the data from this zone. And you’re done with it. So then moving into the next column in the data lake, we consider the raw data seven, which is where this is the large data store, if you will, of all the original data in its original format. But with proper masking and tokenization, so that the input could be coming from the transient landing zone. The output goes into the trusted zone or the sandbox area. And these are some of the policies that you should think about in terms of masking tokenization user access. Maybe role based access control. Now here data quality becomes quite important where you may be doing profiling and entity level and field level checks of the data that has come in. And these are some of the transformations, you may be thinking about as you create this single view of truth, data set. In this role zone for downstream axes. And then, data lifecycle management, and other things so we see increasing number of customers starting to use cloud based storage layers. So like s3 and other object stores. So you could keep your raw data in these environments. As you can progress throughout the different phases of the data lake. Alright. So next let’s consider the trusted zone. So this is where based on the data stewards and SMEs. That may be part of a Chief Data officers organization you have now validated, and you’re providing a set of data sets, to be used widely across the organization. So the input comes from the raw zone. The output from the trusted zone can be used by the refined zones. And the processes that you do here are where you applied zone specific policies you register in the catalog. And then you may apply. Hello. Be specific transformations, so that, let’s say if you are the marketing team. if you are bringing in all these marketing data sets, you’re now creating a customer 360 view of the data right so this is a trusted data set that anybody in the organization can now use because you have done some of the correlations and things like that. and again, from a policy standpoint you need to apply various security policies and data Lifecycle Management. So then you go into refined zones and this is where you get into very use case specific definitions of the data set so you get the data from the trusted zone, you’re creating use case specific derived data sets. So that, let’s say you’re doing a specific transformations where you are creating aggregates or you’re creating denormalized data sets to be used by your serving layer from reporting and other standpoint, and then you’re applying different policies in terms of user access and lifetime of the data based on the use case and things like that. and then last but not least is the discovery sandbox, which is where you may be bringing in data from various different zones including raw trusted and refined and you may also be bringing in some of your own data. So let’s say you’re pulling in data from the web, or some third party data sets. So you’re, then processing the data, generating analytical models out of it, or insights out of it. And then, optionally some of the results can be sent back to the raw zone, so that it follows the same process again and again. You’d apply certain policies on this data. In terms of who can use it, and typically you define lifecycle of this data so that you don’t want to keep this data around for an extended period of time, unless it gets operationalized in the data lake. 

So that’s kind of the high level view in terms of the different zones in the data lake. so this is kind of how we think about it from a source system ingestion standpoint you need to have different ways to ingest the data. And then you need to have various capabilities that are foundational in terms of metadata management data quality cataloging and security and governance. Alright, so I’ll change gears a little bit. So now you need to think about, how do you holistically build this data lake, right, and we try to map it into three main categories. One is enabled the data lake. Next is governed the data lake. And then how do you engage the business. So this is where in the and enabling the data lake you need to put together platforms or frameworks to be able to do manage ingestion and metadata management, because that’s vital for creating this foundational layer, so that you can then leverage your data in various ways that we talked about, when there are various different attributes here, where you need to be able to ingest different types of data, whether it’s batch or streaming. Being able to kind of map it to a repeatable pipeline, so that you’re able to monitor how much data came in when. What was the quality of the data and things like that. Similarly, you need to think about governing the data lake once you enable the data lake and this is where you need to capture metadata to provide the lineage, as the data entered the data lake. And as it got transformed to the data lake, a lot of times we also see requirements where customers are bringing in source data from some external systems, and they may also want to feed the data lake, the lineage information that came from the systems so that you have an end to end view of the lineage. So you need to think about how would you integrate, not just with the data that is coming in from other sources systems but also with metadata that is coming in from the source systems. So that’s important. 

And then, thinking about data quality, so I think this is an often ignored topic in these discussions. And we see this is so vital in terms of being able to build production grade data platforms where you need to put a lot of emphasis in terms of how you would ensure data quality, how you do it from a business centric view, so that you’re not having an IT shop. Go in and develop a bunch of code, every time you need to onboard a new data set. right so you need to think about metadata based data quality definitions, based on rules that you can then enforce on the data as it is being on boarded into the data lake and being able to capture metrics. As you ingest the data so that you’re able to make decisions in terms of whether you want to run your downstream pipeline. On this data or not, because you may be getting some junk data. And you may not want to do further processing on the data. So being able to separate good records from bad records and being able to automate that whole pipeline is very important and something to think about. 

So, moving along this circle. So then, data security and privacy. I mean, we all understand. This is a key area where you need to spend time where you’re not just providing policies at the data access layer on who can access the data and what can they do with the data, but you’re also defining role based access control and entitlements at the data management learn so important to kind of make sure you have an end to end, security and governance story there. And one other key area where we’re seeing more and more requirements, is that, as these data lakes are growing and customers are able to bring in a lot of volume of data, how do you provide a cost effective way of keeping this data over a period of time. And that is where you need to think about. Hot warm and cold areas in the data lake. So even in the zones that I define there, you need to think about, how do I keep my hot data and warm data and cold data in different storage tiers because you can actually take advantage of some of the storage technologies that are out there. And how do I do that in the metadata level, at the business metadata level so that I can define policies and I can then enforce those policies on the data. And last but not least is you need to bring in your end users or the consumers to the data lake so this is where data catalog is vital. And then some of the capabilities in terms of self service data preparation, where you can wrangle the data and you can feed the output of the wrangling process into an operational layer, so that you can automate it quickly, capturing the metadata enforcing the governance models and things like that. All right, so that’s kind of the key in terms of what we think about from a data lake architecture. 

We are seeing an emergence of cloud based platforms so some of them are cloud only or cloud first, and then also hybrid cloud approach. Right. So you need to think about, how do you provide data privacy and security across those different environments. And then how do you take advantage of some of the cloud native features so that you’re not just trying to lift and shift, an existing on prem Hadoop cluster or data lake environment into the cloud, but you’re actually taking advantage of some of the elasticity and cost effective ways of managing the data in the cloud. And while doing so how do you have a consistent data management and data governance layer. So, I mean these are some of the things to consider. I won’t go through all of them, but keep in mind that you need to think about a cloud agnostic strategy, at times, because you may need to deploy, depending on your use case you may need to deploy in multiple cloud environments for supporting different geographies, for example, we have a customer, where data that is created in a given country has to stay in that country. So now you need to find a cloud provider who has presence in that country, so that you are able to support those type of use cases. So you may not be able to just stay with one cloud provider. You may need to think about multi cloud. And how do you enable those kinds of use cases. So I have put together. Sorry, put together a kind of cloud Lake data lake maturity model. So first is we see folks who are having on prem clusters, they’re just trying to move that into the cloud, using the infrastructure as a service layer may just do a lift and shift to kind of get it off of their infrastructure or hardware into Cloud providers hardware. But then we also see some of the Greenfield applications start with cloud native features, where they’re taking advantage of the elasticity they’re taking advantage of some of the optimized storage platforms in the cloud. And you can gradually migrate to that from a lift and shift approach to cloud native features. But then as you grow in this Maturity Model, you need to think about hybrid cloud, and multi cloud. So if you have different cloud service providers if you have on prem and cloud. How do you provide a consistent layer of data management and data governance, so that your applications can be portable across these different environments, right so containerization and other things help us at the application layer, but you also need to think about the data layer, and making it generic enough with an abstraction, so that you can enable the use cases. All right. so how do you get started. So this is kind of a blueprint that we use. So, first of all, make sure there is a business value of what you’re trying to do. 

Right. So what are the business drivers. And what are the business questions that you’re trying to answer. And then maybe think about the data sets that are needed to solve those business use cases. What are those concrete use cases that you can define. And what does the platform look like, and then create a roadmap based on that and create the managed data lake platform, and then define how it maps into your long term analytic strategy as you become a data driven organization. Timeline wise, these are some of the numbers that we have seen, but it varies depending on how fast or how slow your various processes may be, so being able to do some quick PLCs, and then being able to stand up a data lake platform, and then start building the use cases on top of it is kind of a natural progression of how you can go along this journey.