March 15th, 2017
Learn how to build a modern, scalable data architecture to get business results.
When building your data stack, architecture could be your biggest challenge—yet it could also be the best predictor of success. With so many elements to consider and no proven playbook, where do you begin when assembling a scalable data architecture? Ben Sharma shares real-world lessons and best practices to get you started. If you are concerned with building a data architecture that will serve you now and scale for the future, this is a must-attend session.
• A recommended data lake reference architecture
• Considerations for data lake management and operations
• Considerations for data lake security and governance
• Metadata management
• Logical data lakes to enable ground-to-cloud hybrid architectures
• Self-service data marketplaces for more democratized data access
So you’ve built your own data lake now you need to ensure it gets used. Zaloni Arena can help build your enterprise a modern data architecture. Get your custom demo today!
Read the webinar transcript here:
Morning everyone, I’m Ben Sharma, I’m the founder and CEO of Zaloni,. So the topic of my presentation today is building a modern data platform based on a data lake architecture, so I’ll talk about some of the architectural patterns of building a data lake, some of the requirements, as you think about how you go into your next-generation modern data architecture, and then we’ll talk specifically about kind of how do you think about data management data governance and things like that. All right, so I’m representing Saloni so we are a data lake software company. So we have two key products that we bring to the market bedrock which is a foundational layer for data management and data governance, and then mica which is a layer on top of bedrock focused on self service capabilities for the business user.
All right, so before we dig into the data lake or building a modern data architecture let’s talk about why are we seeing emergence of this new paradigm of doing things. So, as we talk to various customers in different verticals one of the key things that customers are trying to build is an agile data platform right so traditional architectures vs a modern data architecture are not well suited for all the different use cases that they have, and especially the emerging use cases. So how do you create an agile data platform. Using a modern data lake architecture, so that you have a scaled out storage and compute layer. And at the end of the day, you’re trying to get insights quickly out of the data that is coming into the platform. But let’s define what’s what is a data lake. And what are some other requirements of a data lake. So, the promise of the data lake is where you have a single unified repository where you can ingest data from a variety of different sources. They could be enterprise sources enterprise data stores where data may be coming from operational data platforms, or your enterprise data warehouses, or they could be external data sources, right, so we’re seeing a lot of use cases where customers are bringing in data from sensors and machines that are actually deployed in products in their customer environments so that they can provide better customer experience to their end users. So the single unified repository is not just meant for dumping data in its raw form. It is also used for creating standardized data models are standardized data sets that you can use across the organization in a trusted manner. It’s also used for creating refined data sets for specific use cases that you can then provide for downstream access. And then the other requirement that we see is that it is a platform where you want to bring in variety of workloads, to run on the scale out environment so those workloads could be your typical reporting analytics type of use cases but then also, more and more, machine learning and those kind of applications that want to iterate over data multiple times but want to do it in a cost effective manner. In a scale out storage platform, and using parallel compute architectures right so think about this single unified repository now being able to capture data for a long period of time, because you don’t have to get rid of the data that you used to do in a typical data warehouse environment. And now you’re able to run your algorithms on large data sets, using parallel processing.
The other pattern that we see is that these architectures are becoming more and more converged. What do I mean by that. So, typically when data lake and Hadoop type of technologies, started it was mainly for batch workloads. So now we are seeing more and more streaming use cases coming into the same environment so you need to think more from a converged standpoint, how you marry, both in memory as well as batch workloads into the same environment, leveraging the same infrastructure. So ultimately the goal is to provide shorter time to insight for your business use cases so that you can leverage these data lake platforms to create an agile data environment.
All right. So, this is a changing pattern that we see from a traditional data architecture to a modern data architecture. So, in the traditional data platform you have the source systems, going through an ETL process loading into a data warehouse, and then you’re creating the data Mart’s you creating the canonical data models and creating the data Mart’s on the DW platform for your downstream access whether BI tools or applications that are using it. What we’re seeing is that this new modern data architecture where you’re first bringing all your data sources into the data lake environment. And we’ll talk about these different zones in the data lake in a minute. But you keep the raw data in the data lake for an extended period of time, and then you’re able to create derived data sets for different use cases. And then also, you’re able to provide sandbox areas for ad hoc exploratory analytics right so you need to think along those lines, in terms of what are my requirements. Now, to provide these multitude of different access patterns in the data lake itself. And what are some of the things from a management and governance standpoint that I need to put in place. But then you’ll also see that the E dw still exists because this is more of an augmentation as you start building the data lake where now you can do a lot of the heavy lifting. In the data lake itself to feed into the database so that you are still supporting your legacy bi applications out of the data warehouse, but at the same time, new applications that are emerging the Greenfield applications can run in a converged manner in the data lake itself. So that’s kind of the pattern that we see over and over again, where because of cost and storage and other requirements, you are now able to shrink the footprint of the DW to kind of support the existing use cases but the new use cases can actually start deployment in the data lake itself. All right. So, having said that, we have gone through many customer environments and come up with what we call the Big Data maturity model to build a modern data architecture.
So I wanted to kind of run this by you, and this is something that we have recently validated with analysts like Gartner and others to see if we are right in terms of kind of how customers are thinking about the data lake. So in terms of the maturity model the first phase we see is ignore right so you’re, you don’t even have a data lake platform right now. You’re mainly based on a traditional data store with a data warehouse. And I think this is where the majority of the market is right now. But they’re looking at how to, how to transition from just having a data warehouse type of environment into something that they can build in a scalable manner.
So, the next stage is what we call store. So this is where customers are building scale out platforms, whether they’re Hadoop or Amazon and others. But there are four very specific use cases one or two use cases. So there isn’t a whole lot of thought put in terms of modern data architecture and management and governance. It is bring all the data in for that one or two use cases, make those use cases work that works great for the ad hoc exploratory data science type of projects that you’re doing. But then when that gets hold. And now the organization needs to support a multitude of different use cases. That is when we see them migrate to what we call the mat phase. So now you no longer want to have a data swamp, where it is not managed in the store phase, you actually want to have proper data management data quality data governance and those features, so that you can scale it out, not just provide the platform for that one or two use cases. But for many different lines of businesses, you can create a shared services model of providing this data lake to those different lines of businesses, and then going forward. We see that there is an automate phase. So now you have a management platform in place management, manage data lake environment in place. So how do you now automate these data pipelines so that you are able to bring in data from many different sources in an automated way, and are able to reduce the time to insight. Right. And then the last phase that I think is more aspirational right now is the optimized phase, which is where now you’re using machine learning and other algorithms to actually help you in the data management task right so how do you actually provide these capabilities, so that you’re using predictive models to detect faults in data for example, so you’re able to tell your line of business, that the data that came in, didn’t look right, based on the data that you have been ingesting over a period of time, you’re also trying to do things like find duplicates in the data one common use case we see is entity resolution and deduplication of the data, how do you do that as part of probabilistic algorithms that you apply in the data lake itself so that is where we see that maturity come in and provide you those capabilities. Question Data Quality checking, you are, you may be doing some data validation but it may not be kind of done at scale for example. So, it comes with the majority and I mean, some of these things can move either way in one way or the other. But that’s kind of how we see this being structured. And also, the numbers below where the market is, is something that we have been trying to validate with various analysts and we have gotten positive feedback so majority of the market is in the ignore phase and then folks are moving into the store, which is creating a data swamp basically just dumping the data, and then slowly they’re kind of moving into the Manage phase, and then very few in the automate phase. All right. So, and this is kind of the value curve, if you will, in terms of where you’re at and the value that you realize, or the platform.
So having said that, Let’s take a look at what we consider a reference architecture for the data lake. By the way, all these slides are posted in the strata website so if you go to the link for this session. You should see a link to SlideShare where this is available. So when we think about a data lake, we use a reference architecture, which is flexible enough where depending on different requirements. You can change things but we think from a zone perspective, right. So, you have your source systems, bringing the data into the data lake. We consider that there may be a need for a transient landing zone, and I’ll give you an example. So working with all the financial services customers, they want to make sure that the data is masked and tokenized before it is made consumable, so they don’t even want to make that raw data is available to the downstream users, unless it has gone through their treatment. And that’s where you may need something like a transient landing zone where you’re landing the data temporarily before you make it available in the raw zone. And when you make it available in the raw zone, the proper security treatment has been done to the data for example. But, this becomes your largest zone, the raw zone becomes your largest zone in terms of the volume of the data that may be in the data lake right because you may have a retention policy where you’re storing this data historically for a very long period of time, we’re working with a healthcare insurance customer, and they want to store it for the life of a person. So think about kind of the timeline there in terms of how long the data could be there. So then, we see that the raw data may need to be treated or validated and published in what we consider the trusted zone. So this is the certified data sets in an organization that may be defined by a centralized data authority, like a Chief Data officers office where they have gone through some validation of this data and say that this is good, certified data. That should be used for all the downstream use cases. So we see that for creating say a single view of customer or single view of product or kind of creating this validated data sets that can be certified and trusted for rest of the use cases. And then we consider something called the refined zone.
So in the Refine zone. This is where you’re now doing correlations and aggregations and creating new data sets that are for specific use cases. So for example, let’s say there’s a marketing analytics project, and you have to find the lifetime value of your customers, so you may be doing correlations across multiple data sets and creating this new derived data set that you’re putting in the Refine zone. But you’re doing this in a way that it is reusable so that if somebody else needs this data set, they have it available but it is in the refined zone, if you will. And then the last area that we talked about is the sandbox so this is a space that is less governed, which is where you can bring in your own data, do some ad hoc exploratory use cases. And then you can throw it away. If you don’t want to operationalize these things right so this is the area where you would play with various data sets that you could source from these different zones, you could source some of that from the raw zone. Some from the trusted zone and some from the refining zone and be able to build your models and things like that. So that you can then operationalize these data sets.
So, having said that, Let’s take a look at how should you think about kind of the Holistic Management of a data lake as part of a modern data architecture so this is our view of how we think about the different features that are needed for managing a data lake, and that starts with enabling the data lake, which is where you have a managed ingestion process to bring in the data from various sources systems so it could be from mainframes relational databases files that are being dumped rest API’s that are getting data from various websites, and then also streaming data sets right so you need to think about it holistically. So that, as, as you can think about the data lake as a converged modern data architecture. How does streaming ingestion work, and this environment or modern data architecture. The other aspect. As part of enabling the data lake is being able to do. Auto discovery of the metadata, so that you can tag the data you can capture various operational metadata, along with the business and technical metadata. As you ingest the data. So, once you know how to get the data into the data lake. We think about governing the data lakes, so in governing the data lake. We make sure that there is proper lineage maintained for the data as it goes through these different zones, if you will, like from raw to trusted to refine or you’re doing transformations and aggregations or denormalizing the data for various use cases you’re capturing all of that data provenance so that you can show from a lineage perspective, what is going on, especially for regulated industries, this is kind of a mandate right so you need to have traceability of how you came up with your risk models for sicar or risk data aggregation and things like that. The other aspect that’s often ignored in the Big Data space is data quality right so data quality is not thought about as a core functional area and I think this is, this is one of the key gaps that we see because otherwise, if you don’t have proper quality of the data it’s garbage in, garbage out, right. So being able to have either deterministic or probabilistic approaches of checking the quality of the data is important. So you need to keep that in mind. And then the other aspect we think about quite a bit is the privacy and security of the data of being able to have data secure at rest in motion, as well as specific fields of the data set that may need to be masked and tokenized, as they are stored within the data lake. And then last but not least, in the governance phase is data Lifecycle Management, as these environments, grow. They could be substantially large data’s platforms, right. So you need to think about a policy based approach on how you retain data. And if you have like different zones within the data lake that are like hot warm and cold for different access latencies, you need to think about a policy based way of specifying that the data set level, how long they stay in each zone right so that you can move from one zone to the other, based on an automated process. So once we have figured out the governance aspects of it. We talk about how do you engage with your business users, right. So, ultimately it is about the business use case, ultimately it’s about reducing the time to insight. And that’s where we think about providing a rich catalog, based on all the metadata that you have already harvested in the data lake. Being able to also provide some of the Self Service functions on the data lake, that may include self service data preparation tasks or self service data provisioning where you’re creating a sandbox and things like that. So that’s kind of what we focus on with bedrock and mica so bedrock focuses on enabling and governing the data lake and mica focuses on engaging with the business. Mainly geared towards business users whereas bedrock is more for the technical use. All right, so let’s talk a little bit about data governance, so this whole world is changing in terms of how data governance and metadata management is implemented in these scalar platforms. So one of the key things about governance is that you need to have a really solid foundation with metadata management, a common metadata layer is critical to make sure that you have a solid governance story. But at the same time it needs to factor in that this is a distributed platform that you’re working with you may have multiple types of data stores, not just a relational structure. And then it needs to be lightweight, because customers are tired of kind of multi year multi million dollar projects to have a government plus governance platform in place so you, it needs to be such that it actually fits in the Agile model, and needs to be lightweight. And then the other aspect here is that it needs to be both a top down approach, where central data authority can specify and certify data sets. And it also needs to be a bottom up approach where you can crowdsource some of this information from your data consumers. So this is a model that I actually use from Gartner. So, in the Gartner analytics data and analytics summit last week actually they talked quite a bit about this. So they think about these concentric circles, where at the core, you have centrally governed critical data elements. Right. And then beyond that, you can have reasonably governed. So they may be owned by lines of businesses, and they are departmental datasets, let’s say, and then the outer circle is basically locally governed. And they are used for specific applications where it’s loosely coupled with the rest of the governance models. So keep this in mind as you think about how you implement a data governance structure for your data lake environments. All right, so metadata management is key to having a govern platform. And then, when you talk about metadata, when you think about metadata management, it’s not just the technical metadata. You also need to think about the business and operational metadata, what are my standard terms that I can bring in from my Business Glossary. What is, what are some of the operational metrics that I’m capturing on the data that I’m bringing in. So ultimately you’re trying to reduce the time for insights or analytics, and then. To do that you need to be able to capture different types of metadata, like the things that I have listed here. One of the key requirements that we’re seeing, especially in large enterprise environments, is that this data lake does not just exist in a silo. Right. It is part of a broader enterprise data landscape and data lake needs to be a good citizen in that data landscape. So one of the key requirements is to be able to exchange metadata. So you may have an enterprise-wide metadata repository where you’re collecting metadata from various data platforms. So the data lake needs to participate in that. So the way we think about it is that from the data lake. You need to have something that plugs in to your enterprise-wide metadata repository, and we use a metadata exchange framework to be able to do that so if you have things like colibra if you have things like IBM GC where you’re making, where you’re defining the standard terms and the glossary and you’re maintaining metadata, being able to get those definitions into the data lake, as well as being able to feed into those repositories is important. All right. Alright so going back to the slide so manage ingestion is key part of how you think about bringing in the data. So you need to think about a scale out way of doing ingestion so self service ingestion is great, it’s needed because you want the consumers to be able to provide some of these ingestion functions, but then you also need to think about how do you scale it out because you have source systems that may have say 10,000 tables, how do you now bring 10,000 tables into the data lake in one shot for example, right. You don’t want the user to be pointing and clicking around and creating 10,000 definitions to bring in these tables. So you need to think about a scale out way of ingestion. And it’s not just from tables or files, it’s also from streaming, do you integrate with Kafka or flume or think a method, how do you get the data from message queues into the data lake platform. But as you get the data. You need to make sure that there is a strong metadata integration with it so that your common metadata layer can now be populated based on the data that is coming in. Right. lineage is another, another critical area so lineage captures how data moves through this data lake platform. Again, this is something that should be shared with the enterprise-wide metadata repository for very various use cases you want to have that view. So let’s say your source system was an oracle database that fed the data into the data lake and it went through three different steps, and then a report was generated, that was published to click on tableau, you want to maintain that lineage and be able to share that lineage, with your metadata repository, so that you can also do impact analysis, if needed to if you’re changing some processes you can see what things can get impacted. If you capture the lineage information properly. Right so data quality is another aspect, so this is key in a lot of different use cases you need to think about how you implement data quality. We have started down the path with a rules-based engine, so that you can specify dq rules. And it is integrated in a managed data pipeline like that. So you can see that we are doing entity level checks so that you can check for duplicate files and if the whole file came in and things like that, whether it matches the schema and all that. And then you may want to do field level checks. Does the field is a validate format is it within a specific range and things like that so that you are able to separate good records from bad records and have a remediation process of the bad records in the data lake itself so that you can tell the data producers that you got so many bad records that violated an SLA, for example, so that you can kind of go through the process with the data producer, and it needs to be automated with notifications and things like that. What we’re looking more and more at is machine learning-based classification of the data so that you can automatically detect bad data and separate out the bad data based on predictive models and things like that. All right, security and privacy a key aspect of building a modern data architecture. So you need to make sure that your infrastructure is secure. For both data at rest and data in motion. One of the key aspects is that if you need to have encrypted volumes for data at rest, you need to put that as part of the your platform requirements as you build this out. We also see role based access control for both the metadata and the data, so you may have a bunch of users, let’s say they’re for the marketing game, they should only have access to the artifacts for the marketing department and the data sets for the marketing department, versus another set of users for the finance team. So they have access to the data sets that are from the finance side of things. So you need to have that model where you can define role based access control on both the data and the metadata, and then being able to mask and tokenize data for various use cases like we see now, with EU GDPR coming up very often. So you have to be able to protect sensitive data so how do you do that at the metadata level, so that now the data that is made consumable in the data lake has gone through the various checks, and then being able to track all of this, so that you can provide audit and access logs. Raise alerts and notifications. All right, so data Lifecycle Management is another aspect. So I talked about this briefly but being able to have different zones in the data lake with hot warm and cold. And this is kind of our view of how we provide that iin a Hadoop based platform, or if you have an object store so that you can archive into the dog store so you can see at a data set level. Let’s say this is my customer data. I want to keep it in my hot zone for 30 days after that I want to move it to the warm zone. I want to keep it in the warm zone for 90 days, and after that I want to move it to my cold zone which may be based on a object store like a stream. So you need to think along those lines, in terms of how you manage the lifecycle of the data across these different storage tiers, if you will. Alright, so now going into kind of the business users right. So now you have the data lake (modern data architecture) in place, how do you engage your business users so this is where you need to think about a rich catalog, so that the users can find the data easily. It’s not just showing them the list of different data sets, but they also need to see the KPIs about the data, how good is my data, what is the quality of this data, how many good records versus bad records, how fresh is this data and things like that. What’s the lineage of this data, how did it get created, and those functions should be integral to a catalog. And when you think of a catalog. These data environments are growing customers are building different platforms in different environments that could be cloud based or on prem, so you need you need to think about a catalog in a much more unified manner, so it needs to be more logical than physical so you are not just showing them the data sets from one physical environment you’re able to correlate data sets from multiple environments and present one unified view. So you’re almost creating a Data Portal or a data marketplace. In this example. Alright. So once you have the catalog. Now you need to enable some of the Self Service functions so self service data preparation, where the business users can do these functions without having to go to it is an important aspect of it. But at the same time you need to think about where, whatever the business users are doing. It’s not just throw away right so you need to be able to have that being operationalized in the core data management platform, without much effort. So that’s always kind of an approach that you should think about as you enable these self service functions so the other aspect we see is that of being able to source, various data sets in a shopping cart experience, and being able to create a sandbox whether it is in the data lake environment (modern data architecture)or into MPP or a relational data store so that you can create a sandbox and do your ad hoc and exploratory analytics. All right. So Cloud is an important consideration. Most of our customers, either are building a cloud hybrid cloud strategy where they have on prem and they’re onboarding cloud infrastructures are some of them are starting as a cloud first. Right, so you need to think about what are some of the considerations. With the stack and how you deploy data environments. Being able to leverage some of the cloud features so you just don’t want to do a lift and shift of your existing environment to the cloud. But how do you leverage elasticity, how do you leverage scale out storage. Cheaper storage platform than say storing it in a file system that is much more expensive. So you need to think about that in terms of not just a single type of storage, maybe multiple types of storage so you may have an object store you may have an MPP database you may have a distributed file system how do you leverage all of that, as you move to cloud. And then how do you maintain security. Security comes up very often because enterprises who have built these data platforms on prem integrated with their identity management and access control systems now as you bring in cloud How do you integrate with your existing ad or leverage some other cloud based identity management platforms. So those are the key considerations as you think about building the data lake. So we have a new book on how to deliver the manage data lake so it’s available in our booth, as well as on our website so if you are looking for it, it’s developed by ESG group, Nick Rohde and others. So feel free to download that.