Data Governance Framework for DataOps Success

November 10th, 2020

Read the webinar transcript here:

[Ben Sharma] All right everyone, welcome to this webinar on data governance framework for data success. I’m very excited to talk about this topic. Thanks for joining. Thanks for your time today. We will be recording this webinar, and this will be available in bright talk once the webinar is over. I’m Ben Sharma I’m the founder and Chief Product Officer of Zaloni. Here it’s learning with our arena platform. We help our customers, manage the Sprawl of data that is occurring in their enterprises data is being generated everywhere, its internal sources, the operational data stores. It’s coming from external sources third party systems. It’s on-prem and cloud. It’s batch data as well as streaming data. So we see a lot of different types of data being generated and enterprises are having to deal with all these plethora of different sources, while managing the data in a government secure me while making it available for self service access, while making sure that sensitive attributes are not been shared to unprivileged users, as well as making sure that the enterprise data governance policies are being called. But at the end of the day, what this is all about, is how do you enable new business insights, how do you enable new products and services, how do you keep your customers happy by driving agility of your analytics initiatives, not just with governance, but making sure that you’re geared towards delivering business value. When we work with large enterprise customers some of the logos that you can see here. They have complex data environments. And these environments need to have a foundation in place so that they can not just deploy one or two use cases, they have this backlog of use cases that can be pipelined and deployed in a reusable way with the proper governance with the proper security controls with the proper quality in place so that ultimately the data consumers are enabled for self service access. We’re very fortunate to have key partners like AWS or your MongoDB and others that we work with, in terms of delivering the end to end. Data Platform, as our customers are moving to the cloud from on premise products. And then we have partners who deliver our technology along with the others. We have also been very fortunate in terms of industry validation with awards from CIO magazine banking technology award and others that we have won over the last 12 months validating the approach that we are taking with our data ops platform. So before we move to the core of today’s presentation Let’s set some expectations in terms of the challenges that exist today in the cloud data word, or as customers are moving their data platforms to the cloud. What we see is that in addition to the on prem data stores that are typically relational databases or files that are, that need to be managed. There is quite a bit of complexity that has been created with external data stores, which may be coming from third party systems or applications that are generating a digital transformation and applications that are generating new types of data that were available or were entirely before. These are new use cases where you are now having to sign up customers where they are using a mobile app on the phone, versus visiting a retail banking branch, or they are applying for a loan on their phone, versus going and talking to somebody in a retail location in a bank for example. So there’s a lot of new data sources, a lot of new stores that needs to be handled and different types of data. At the same time, what we see is that a lot of these use cases are being implemented with stitching together with scripts and other tools and technologies. So inefficient data pipelines which cannot be reused or which cannot be deployed across a different set of use cases so they’re very specific, and they’re fragile and not very maintainable as you scale out all these platforms. The other key area of challenge that we see is that there’s inconsistent governance, so enterprises have well defined policies and procedures in terms of how governance of the data needs to be managed. But as you think about these new technology platforms, whether it is on prem and cloud included in a hybrid architecture or as you move to a cloud based data warehouse. There isn’t a way to standardize the governance model. if you do not have a solid foundation in place. So what we see is that without proper metadata management without proper data quality. A lot of these data initiatives. Do not go very far because of the challenges that are related to governance and compliance in large complex enterprise environments. And then last but not least, is that no matter what you do in your data platform, if you are not able to enable your data consumers to be able to come and get access to the data by themselves in a self service manner. Then there’s too much time that is needed to make data available for various downstream use cases. So having self service access to the data with the proper data governance is one of the key challenges that we see in these organizations. 

(6:00: DataOps Maturity Model)

So before we move on, let’s talk about like, where we’re at, and how we think about this as a journey, versus going from not having a data ops approach to having a full fledged data ops approach. So it’s a learning, we try to evaluate various initiatives that our customers are doing, and then map it into a standardized maturity model like this that we have developed over the years, where there are multiple stages of this maturity model. So I’ll walk you through what this means. From the concept of thinking about data governance in the context of data ops. And then also define a little bit about like what we think is data ops what is the core of the dots as you can try to build this out in your organization. So going from left hand side to the right hand side. So if you think about your first stage. This is where it is an unmanaged, so you’re basically serving your use cases but without a proper foundation without proper management of your data. So you’re doing this in a very ad hoc basis. As you have requests from lines of business teams are different departments you’re putting together an initiative, you’re fetching the data from the source system you’re bringing it in into your data platform, whether it is on prem or cloud. There is no concept of metadata management, there is no concept of reusability of data pipelines, you’re just making it available for the downstream, use case for the business unit and moving on. And you do that over and over again as you get new requests. What that does is that it creates ad hoc siloed data stores, it doesn’t have a proper Foundation, and you’re doing it in a manner where it takes too long for your data consumers to be able to get access to that data and use it for their analytics, a lot of times, deadlines pass because the time to insight is too long, and the data is not clean or not trusted, so that it can be used for making these critical business decisions. So from that stage, you go to what we call a manage state, which is the second phase of the state of maturity model. So this is where you use some programmatic way to be able to go get the data from the source systems where you’re doing data discovery, across your various enterprise data assets you’re actually then mapping it to the data plant where you need to put this data into, and then you’re mostly dealing with technical metadata at this point. With this, it is better than the first phase, but here you have improved visibility of like what data exist in your organization. And then there is opportunity where you can reduce the duplication of data. So if, for example, there is a source system which manages customer data in CRM, and you have line of business one, come and access that data, and you fetch the data when another line of business wants to come and access the same data, you know that this data already exists, which you may not know in the first stage, so that you can reduce the duplication of the data and you can make the data available for this other line of business, so that they can support their use case. Now, going from this manage state to the third stage is what we call the operationalized stage. This is where now you’re thinking about creating reusable data pipelines in stage two, you create a data pipelines, but now you’re making them reusable so that it’s not just for one of the source systems it’s or multiple source systems that you’re actually able to reuse them for you are also putting data ops mindset where you’re automating these data pipelines, so that various things can be done in a way which is programmatic, which happens based on an event, bringing in some of the concepts of ci CD from a DevOps standpoint, and making it available or making it applicable from foreign data pipelines one good example is being able to run a battery of test cases as you bring in this new data set, and be able to validate the quality and the governance and various aspects aspects of these datasets, and then if those battery of test cases pass, then you’re promoting certain things from your lower environments to a production environment for example, in an automated way and making it available in the production environment so your time to deploy is much shorter than what was there in your previous state. So this is where the business value is that the supply chain and the efficiency gains are obtained from your data pipeline. And then you’re also reducing the duplication of effort. So in the previous stage you’re reducing the duplication of data, but now that you’re creating these reusable pipelines, you’re able to reduce the duplication of effort. In addition to reducing duplication of data, so that you can reuse these pipelines, over and over again. The next stage is what we consider as the govern state so now that you have operationalize these data pipelines. You also need to think about various aspects from a data governance standpoint, so that you can make trusted data available. So for example, being able to check for data quality in an automated way, being able to do profiling and classification of the data being able to mask and tokenize sensitive attributes, make data available based on a role based access control model, make sure that self service is enabled. So all of those aspects that we consider or have a data governance model needs to be enabled on top of your data ops approach so that now you’re not just making the data supply chain more efficient, but you’re doing it with the enterprise data policies with the enterprise data compliance policies in place. So, the outcome of this stage is that you are now able to access trusted data with standardized governance and self service access across the organization so that you are really reducing the time to insight and adding value for various business use cases. And the last and the final stage that we think about it from a mature Maturity Model standpoint, is where you’re now using advanced analytics technologies like machine learning to make data management better. So, in this case, you are enabling these functions as part of the data ops pipeline, so that you can check for anomalous data you can make remediation smarter so that bad data that is detected after a data quality check can be automatically remediated to your data producers, for example, you can do things like automatic data classification so the sensitive attributes or other types of attributes can be categorized, and then you can automatically take some actions like masking and tokenizing these sensitive attributes, or if certain set of attributes are only available to a certain class of users, you’re able to assign role based access control policies on those data sets and make them available in that way. And then also, you’re able to gather a lot of insights about your data. So this is where you can now make recommendations to your end users on which data sets may be related to the data sets that they’re looking at things like that. So that you can have a more frictionless, and timely delivery of trusted data using these AI NML techniques from a data management standpoint. So that is our framework that we think about it from a data ops maturity model and as I mentioned earlier, this is a journey. You shouldn’t expect that you’ll come to stage four without going through some of these earlier stages and operationalizing and making sure these are automated and reusable, and then putting the governance functions in place. A lot of times these go hand in hand. You need to think about metadata is good for automating, but making sure that those are some of the key attributes that you think about as a, as part of your pipeline. 

(15:00: Arena DataOps Cycle)

Moving on, we think about this data Ops, as a cycle as a continuum. So we use this loop, to talk about like how we see collaboration between the two key constituents, the data engineers and the data stewards on one side and the data consumers which are data analysts and data scientists on the other side. So both of these personas, or these two key classes of personas do needs to be managed in a way so that you can serve them up in a collaborative data catalog so that’s the center, if you will, of these kind of two constituents collaborating and making sure they are adding value for each other. But as you think about the various functions along the data management continuum along the data supply chain. All of those things contribute to the catalog. So if you think about being able to capture lineage, which is something that the data engineer will put in place to support the requirements of the data stewards, or being able to master data across multiple data sources. being able to check for data quality and identify bad data sets, or bad data within a data set, or good data within a data set and separate demoed, all of those things needs to be served through the catalog. Same thing with profiling and classifying, so it so that you can see the shape of the data. You can see the distribution and some other statistical measures. You can see classification of sensitive attributes, all of those things, needs to be captured as metadata and needs to be served up through the catalog. On the right hand side as data analysts and data scientists, come and explore the through the catalog. They may be looking at a rich marketplace experience where they can add it to a shopping cart, they can provision a sandbox. 

Employees experience where they can add it to a shopping cart, they can provision a sandbox. You can create an analytical workspace so that they can quickly validate their algorithms and so on and so forth. They can also contribute towards the data governance process. So think about data governance, not just as a top down, but as a hybrid approach where it’s a top down, and a bottom up approach top down, meaning the centralized data authority like data stewards in a Chief Data officers office, defining the data governance policies and enforcing them for some of the key data sets across organizations the crown jewels. But at the same time, when in line of business is dealing with the data sets that are very that’s very relevant to them that they’re bringing in that the CEOs office is not aware of. They should be able to provide governance, as subject matter experts in the governance process itself. So then being able to annotate and tag data, add additional metadata, being able to correlate that data set with other data sets that are coming from the trusted sources. Those are some other things that we see. And again, all of those things could be served up through the catalog so that other members of the organization can also use that work that one line of business has done for example. So that is kind of one of the key approaches we think about, as you think of the catalog as the front and center, but then have all these data management functions, contribute enough metadata or contribute metadata throughout the process so that the catalog can be enriched all the time. And more and more value can be provided for your constituents, whether they’re data engineers and data stewards on one side or data consumers on the other side. We also talk about what we consider a governance model based on songs. So we call interference ammonia, we call it Zaloni’s ends on governance. Again, this is not just a fixed governance model with a number of zones so it will depend on your organization and your requirements and your policies, where you can define zones that can be operationalized as part of your DevOps pipeline, which our software platform provides. but then you can define what are the characteristics of each of these sorts. So for example, in this use case. There are four zones that are being created, there is a raw zone, where data from source systems come in and they reside in the raw format. and then you do various checks and balances on that data and then make sure data quality is checked and masking and tokenization is done and you create data sets and populate them in the trusted zone. So that these are certified data sets that can be used by the broader organization. And then as you think about data form, for a given use case, or new insights that needs to be generated for a use case, those could be created and populated for rest of the organization to be used in the refining zone. And then last but not least, you need to think about an experimental space where you can do ad hoc use cases related to your analytics experiments, and that we consider as the sandbox. So this is just an example and for each of the zones, you can have, who’s who can do what who can read data who can write data, and which personas are able to access the data and add it to a shopping cart for downstream access and so on and so forth. So let’s switch gears for a moment so given that I know you understand like what is a data ops maturity model in relation to data governance. And then, what could be his own base governance model. Let’s talk about, as you think about moving from an on prem to a cloud platform. What are some of the key challenges that you may face and how to address them. So this is trying to capture some of the top four challenges that we see. When our customers are moving from an on premise to a cloud based environment. So first of all, there’s a lot of innovation going on in the cloud services, especially in the data space. We have AWS reinvent going on this week and all the announcements that are coming up. But then, obviously similar innovation and new services being provided by ashore by Google and others as well. However, what we see is that a lot of these services are individualized services right so there’s nothing that kind of puts everything together from a data management data governance standpoint. So you may have a storage mirror you may have many computers one is MapReduce basically they’re like EMR or SDN So, another is a query layer like Athena, or warehousing or like redshift or snowflake and others. And then you have various machine learning based on services. What we don’t see is, how do you provide a layer of abstraction so that all of these services can talk to each other from a data management data standpoint, and you’re able to enforce some of the enterprise data policies right away without you having to stitch them together. So that’s one of the key areas of challenge. The other key area of talents we see is governance, it is oftentimes limited to technical metadata, whereas you want like various other approaches, along with that, where you want to make sure that you’re able to check for data quality you’re able to capture lineage and you’re able to drive role based access control, not at the object level which you can do with your Im roles and things like that. But let’s see the business unit level and being able to do things in a much more abstract level than what the individual services provide, so that you can have and when visibility in terms of risk and compliance and all the other requirements. The other key gap, we see is in terms of metadata, so a lot of the cloud services are focused on the technical metadata which is great, and you need that for all the other services to work. But oftentimes, there is a gap in terms of the business metadata, and more importantly the operational metadata. So being able to tie it into a glossary being able to capture additional business metadata, so that it can be made easily available to the data consumers in the Infinity knock slide that we talked about being able to provide various other mechanisms. From that perspective, is important. As you think about metadata. The other key areas where we see a gap which is that a lot of these services are focused on the developer. So you have to have programming skills to be able to achieve the objectives that you need to achieve so for example, both of these services have a Python API so you’d have to be able to know code in Python, to be able to use these services. You also need to think about how you build in governance into the data ops pipeline. And for the things that a data steward or somebody from the Chief Data office may define may not be easily translatable to a service, which has a specific set of API’s and SDK that you need to go to. So, what we see is that there is a gap for the data stewards and the data governance folks, whereas the data engineers or the technical folks or the programmers are able to achieve the data pipelines and the stitching together. Think about the maturity model and the steps that we talked about. So stage one and stage two is oftentimes, what’s being instantiated. But, as you, then think about stage three stage four and beyond, those are the areas where there are challenges. So before we move on, I’d like to take a pause and let you know that we would like to poll you about your maturity model, and very rare. So please take a moment to answer the fall, so that we can actually capture some information, and you can see how other participants are doing in terms of the poll, and we’ll move on to the next slide. All right, so what is our approach in solving these challenges, so that is where arena comes in, so we have a unified approach in terms of managing the data across the entire supply chain. Being able to provide the various data management functions that are needed, but doing it in the context of the cloud. And I’ll talk about what that looks like in a moment. So having end to end visibility and being able to control and execute your data pipelines, using cloud native services with a single unified platform. It’s a lot of words to say but I’ll actually walk you through what that means in the next couple of slides as I do in to drill into the architecture of what that data ops pipeline looks like at the center of it as I mentioned before, there is a catalog and the catalog is not just the technical metadata, which services like glue or Azure data catalog provide. But what we’re doing is augmenting it with traditional metadata, specifically the business metadata and the operational metadata so that you have an end to end view with good data stewards and data governance folks in mind, so that you can serve your data consumers. The other key approach we have taken is that enable various functions from a data governance standpoint, as part of your data pipelines so that as you think about making the data available in different zones that we talked about, as part of the zone based approach that is part of the feature set of the platform so you can when you configure the platform to map to your underlying data stores, you can define what are the different zones. When you check for data quality you can say put the good data in the trust and so on. Move the map data into a remediation zone from where you can automatically send it back to the data consumer data producers, for example, capturing lineage by default, for example, and being able to annotate any attributes that are sensitive, for example, and then being able to apply masking and organization functions on those sensitive attributes so that as you populate the data from a raw zone group trusted zone that data is automatically masked and tokenized for any sensitive attributes, and by, and doing this all in the context of a very simple, easy to use user experience is one of the other key things that we think about. So here is arenas approach, in terms of how we, first of all, get deployed on a hybrid multi cloud environment so with one instance of very now you can manage. Whether you have on prem data stores our on prem data lakes, and then also if you have cloud and multi cloud instances of these data stores and data lakes. And wherever the results. So we think about three key pillars in terms of the core functionality of every now so the first one is catalog. We’re being able to inventory the data very very dexterous, being able to capture active metadata and as Puncheon business technical and operational metadata, and then being able to classify and profile the data. The second key pillar is all about governance and control. So once you have inventory of the data. Now how to associate various policies that are defined by your enterprise data office, from a governance and control standpoint, on top of the data so that is where being able to check for quality being able to separate out good data versus bad data, being able to define security rules, where you’re masking and tokenization of tokenizing sensitive attributes, where you’re defining role based access control for your data based on tenancy modeling where you may have different lines of businesses define as different projects in the platform, then being able to capture lineage as you go through various steps of the process. Once you are able to govern and control the data. We then think about the next key pillar which is the consume pillar. This is where you’re bringing in the data consumers so that they have this experience of a rich data marketplace, they’re able to find and search for data then they’re able to enrich and collaborate around the data. And then if they like these data sets, they can add it to a shopping cart and provision it in a self service manner to downstream applications or to sandboxes or to various other downstream systems. So this is kind of a high level architecture of how our platform gets deployed in AWS and I’ll just spend a quick few seconds on this and then we’ll move on to the next slide. 

(30:00: Solution Architecture – AWS Data Lake)

As you think about your various AWS services, whether you’re bringing in data from databases files or streams arena enables you to bring in the data and quickly populate AWS based s3 data lake. At the same time providing the data management and governance functions, so that you have a single pane of glass you have end to end visibility and control around metadata, and then provide this rich data marketplace experience to your end customers. So this is just showing you here from left to right where arena sits with those three c functionality data coming in and being stored in your s3 data lake, and in s3 you can have different buckets for your landing zone Euro zone trusted zone refined zone and sandbox zone, and then being able to integrate with glue data catalog for exchanging metadata so that all your rest of the services work seamlessly and then being able to use things like EMR and Athena and be able to process and query data being able to serve data into Aurora or redshift so that you have various applications that can use this data, and then being able to use various AI ml services or use Python, or other tools BI tools can be able to access that data through these different query engines, which is stored in your s3 bucket so this is a high level architecture of how we think about our deployment in an AWS type setting. And then again, being able to tie into your enterprise identity management system with integration with Active Directory and other things. So if you think about now, where we add value on top of AWS native services. So this is our approach in terms of governance, from a data ops perspective on top of AWS services so on the left hand side you have the source systems that we talked about. On the right hand side you have the various query engines that may be used to query the data and the data being stored in s3 buckets and used for various downstream applications. One of the key pieces that we make sure we integrate with our AWS data management services. So this is where using Lake formation and all the templates and other functionality that it provides being able to integrate with including catalog. Being able to use the glue crawlers to be able to find data being able to use the glue ETL pipelines to use the serverless architecture to run some of these data management functions, being able to provide the access control that is needed, but at a higher level, but it translates into the access control that lake formation provides that as you come in through various query services, whether it’s Athena, or redshift and others, you’re able to control access to the data based on a tendency for different lines of businesses or different business units that have access to different data sets. So, that is where arena with the three C’s sits on top of the AWS set of services integrates with them, and provides this end to end governance model provides this data ops center pipelines that you can build in a reusable manner so that you can quickly, provide business insights and quickly make data sets available for your downstream consumers. So just to kind of summarize. As you think about your data governance framework for data Ops, the approach that you’re putting in place. Make sure that you think in terms of the maturity model as you go through the journey of going from an unmanaged to manage to an operationalized govern to an augmented sort of stages, and you may combine some of these stages, as you may think about the maturity that you already have built into your platforms. But you think about it from a unified standpoint, so you want to make sure you have view across your whole ecosystem now exist in your enterprise and it may be a complex data environment, but still you want to make sure that it is hybrid. It can support hybrid, it can support, not just relational databases but new types of data stores, whether it’s streaming sources that are available it’s SAS data stores that are available from third parties and others in one unified view. You think about efficiency and operationalizing this data pipelines so that you can make them reusable, you can make sure that the same approach from a governance standpoint, can now be used across many many different data stores and many many different use cases, you think about your data governance functions as a set of foundational requirements that you need to have in place so that you can enable these use cases. And then you think about your self service. How do you enable your data consumers so that they can come in, they can access to data, and they can use it for various downstream use cases in a quick and a govern. Using a quick and governed approach. So that’s kind of a high level. What I wanted to cover today. I see some questions in the. I know so I’ll actually. 

(35:00: Q&A)

Next, switch over and we have a few minutes left, so I’ll take a few questions, and try to answer them so let me go to cushions. Alright, so the first question, I see is that in your approach with governance for AWS, how does the compute layer work. That’s a very good question. So we have taken a phased approach. It’s a journey. So, in our initial set of offerings. What we have done is we have leveraged EMR mainly to do the data management function. So, if you want to check for data quality, we would spin up an EMR cluster on demand, check or data quality and shut it down while we would maintain the metadata, over a period of time. And now what we are enabling our customers to do is to move to glue ETL pipelines so that we can use the serverless architecture that already has and run our data management functions on top of that, so that you can get the economies of the scale elasticity and all the things that you are moving to the cloud for, but at the same time have the governance in place have the data management capabilities in place. Right. The next question is, does this approach work with other service providers like resort. Yes, I mean, the one I showed you is just a reference architecture for AWS. But if you’re on Azure and you want to have the same set of capabilities, our platform is one unified platform that’s based on a set of micro services. And we’re moving, most of them into Kubernetes deployment model. So you could deploy our set of micro services in an E Ks type of environment in AWS and then you can also deploy it in an aqueous environment in ashore, and then you’re deploying in Azure and is you want us to manage. Various Azure services we would then integrate with ADLs, and various other things that are issuer specific as your data catalog and others. The next question is, can you apply zone based architecture across clouds and on prem. Very good question, and this comes up, more often than you would think. As customers are moving to hybrid architecture. And the answer is yes, so that when we think about zones we think about zones, as abstract concepts that then map to physical storage locations, and those physical storage locations is a one to many relationship. So think about your roles on where you could have some locations in s3, some locations in snowflake some locations in your on prem Oracle, some in your on prem Hadoop data lake, all part of your rods on for example. Now, then when you see that when I come to the marketplace and search for the catalog. I should have access to only those data sets that exist in the raw zone, then it’ll only show me data sets that reside in those locations and that I have access to. So that’s how we think about from a truly single pane of glass or hybrid architecture or a multi cloud architecture that abstracts out the data management options. The next question is, is there a way to automatically detect sensitive or duplicate data. There is so we actually are in early stages right now with a beta of our offering. As part of arena, which provides automatic data classification. So what we do is we use both a deterministic and an ml based approach, a training based approach with supervised learning to detect patterns in data, and it will automatically annotate and classify data sets, or attributes within data sets so that you can then associate different rules so for example, an incoming CRM data set, may have some attributes identified SPI, and then you may have a workflow using our platform where you can see that any attributes that are marked as PII need to be matched or tokenized. And then the masking and organization functions could run automatically, so that the data that then gets populated from the raw zone to a trusted zone is automatically masked and tokenized. We’re almost running out of time so let me take the last question, which is, how do you get started in this journey to move to the cloud. That’s a very good question. What I would say is that, first and foremost, don’t try to boil the ocean. We have seen a lot of initiatives that go nowhere, where they’re trying to just build a platform with no specific business outcomes in mind. So, our recommendation always is to identify one or two use cases that you can quickly show value. and at the same time put the foundation in place. Put your DevOps pipelines in place with the governance frameworks that you need, with the governance functions that you need. Based on your enterprise data policies and demonstrate the success with those one or two use cases and then start bringing in other use cases that you can show value with and quickly. More and more data sets, come into the platform, which means that it becomes the center of gravity, where more and more use cases can now use those data sets. And that actually feeds itself and then now you have a solid foundation to counter. So that’s the approach that we would recommend. With that, I would like to thank you again for taking the time to attend this session today. And as I mentioned earlier, this will be recorded and provided in brighttalk so that you can also refer to this later on. Thank you so much.