Read the webinar transcription:
[Brett Carpenter] Hello everyone and thank you for joining today’s webinar on evolving your passive data catalog into an active data Hub. My name is Brett Carpenter the marketing strategist at Zaloni and I’ll be your MC for this webcast. Joining us Today is Scott Gidley, vice president of product management. Now going over our agenda we’ll be diving into the problem with today’s data catalog, how an active Data Hub can help organizations achieve rapid business insights through self service data, where does the Zaloni data platform comes into play and some real-world use cases that showcase the power of an Active Data Hub. Now, let me turn it over to Scott to give a brief intro to Zaloni and discuss how companies are moving away from traditional data catalogs to embrace active data hubs, Scott.
[Scott] Hey, Brett. Thanks for the introduction, and I’m glad to be here today, and thank you to everyone for taking time out of your busy day to join our webinar. It’s very much appreciated. Before we get started with today’s content. I did want to take a moment to introduce Zaloni to everyone and to drive home why the concept of evolving from a passive data catalog to an active data Hub is something we care so deeply about
(00:55: Turning Data into actionable business)
So for those who don’t know Zaloni’s core mission is to deliver a data management platform that enables Enterprises to move fast and deliver high-value self service data to enable key business initiatives. And we do this by delivering the saloon a data platform, which is a Unified and scalable software solution that allows you to do several things. First, you can collect or connect to any data source, whether this be on-premise relational databases cloud-based applications or data that’s being stored in the cloud. We drive inline data quality, governance and enrichment capabilities and this really negates the need for any additional point solutions that may raise the cost of your overall data platform and we can scale to any level of complexity. So if you want to start a department and then move to the Enterprise both from a processing perspective and in the ability to catalog and use information we can scale any level. And then probably most importantly for today’s session we enable lines of business to enrich and take action on their data. So ultimately we want business users to have self service data capabilities to further enrich and then use their data in a more meaningful way. So during today’s webinar. We’re going to take a close look at the exploding data catalog market and now some of the challenges that many of the early adopters face. Will also show how the evolution from a data catalog to an Active Data Hub will allow organizations not only to source and understand their catalog data, but to use it to deliver immediate business value without having to sacrifice the trust and security of the self service data and with that let’s dig in.
(02:20: Data Catalogs are going mainstream within Enterprises)
So in order to understand the value and ultimately the limitations of the data catalog, I think it’d be good, If we first Define the technology and then look how the market is growing or exploding in several of the factors that are driving its adoption. So for those who aren’t familiar a data catalog can be defined as an inventory of data assets through the discovery description in an organization of data. Catalog provides context to enable data analysts, data scientists and data storage than any other data consumers to find and understand relevant data sets with the purpose of extracting business value. So that’s a mouthful and something that I was able to grab off of Wikipedia. And there’s continued growth in this market so markets and markets. One of the analyst firms is showing growth up to 620 million or 1 billion dollars over the next five to six years and other industry analyst firm we’ve talked to have confirmed increasing inquiries on the adoption of this technology and Technology fifty percent year over year over the past several years. So, you know, what is causing.
So these Market drivers to exceed and really push this data catalog Market into the mainstream. So there are several use cases that we can look at and there are many examples we can go through but on a high level is just a few here to kind of delve into what kind of business problems data catalogues help solve and first and foremost, I think.
Any new data initiative or business initiative you’re trying to drive through a data driven company or culture is a good starting point for a catalog right according to Gartner by 2020 organizations that provide access to a curated catalog of both internal and external data sets will drive twice as much business value from analytic investments that those that do not. And new business initiatives allow users to determine what data is available to them. I can figure out where it was created some level of context as to why it’s useful to Me that really can jump start focusing in on the data assets that I need to do my job and data catalogues provide this immediate value to them, right? If they can look across an organization across the line of business and help different types of users better understand the relevancy and usability of self service data and sometimes the relationships of this data and how it relates to other information in your organization. That’s extremely powerful.
Regulatory Compliance is another area where it’s built on the foundation of being able To document audit and trace information assets and where that information is gone and how it’s being used and in the past we’ve worked with several customers that this could be left up to spreadsheets Wiki SharePoint sites. It could be a combination of metadata management and data governance tools. We worked with one customer that was tracking PII data that was being documented in spreadsheets every time they identified a new dataset within a database that had some sort of personally identifiable information. They were putting it in to a spreadsheet and is that spreadsheet grew they could go and say hey, here’s a here’s where all of our secure information is. Obviously that’s not good for a long-term approach.
Catalogs can play a key role as the glue or the Hub of this information so as that more information is captured and gathered people can log in they can search they can say Hey, where’s all my PII data? And they get the information back not only in a list open if they can also be able to see different relationships to other tables that might also contain this private or secure information
And then lastly, I think the primary use case to get things started as usually some sort of advanced analytics or analytic process. You know, there’s the running joke is that data? Scientists are data analysts say they spend a large portion anywhere between 20% and 80% of their time trying to identify, collect and prepare relevant data sets and the joke is that they spend the rest of their time complaining about how long it took to identify, collect and prepare these relevant data sets. So just like the new business initiatives or even the compliance initiatives data catalogues can make these workers more efficient.
(06:45: The problem with today’s data catalogs)
So now you might be thinking Scott, you know the business value that you just outline that is provided by data catalog seems pretty enticing. So what is the problem? Why can’t I just move forward with catalogs as they are. And the issue here, isn’t really to do with catalogs at all rather the issue is that they play just one part in a much larger fractured ecosystem of self service data tools and different integration Technologies, and so forth. And ultimately determining how to stitch these Technologies together into a single environment for delivering a unified data supply Chain, creates lots of integration and governance challenges that ultimately can limit the productivity in return on the data catalog.
And so before we take a deeper, look at how the sprawling ecosystem can lead to issues. I think it’s important to understand what we see. Our customers asking for, is the ability to Source understand and use their data and being able to do this in a time sensitive manner across different types of users in an organization whether they be IT and data Engineers or whether they’re pure data analyst or business analyst. So as we look at this ecosystem. You can see that there are distinct feature sets that overlap and exist across each of these capabilities. So let’s just say that we think that data ingestion, data prep, analytics, data visualization and cataloging comprise this data supply chain ecosystem. Your data prep Technologies may contain some data Discovery and cataloging features and data visualization tools increasingly allow for more self service data ingestion and analytics and so on so you’re not really
Hey, if I’m going to use a day to prep tool do I need a data catalog if I’m going to use a data visualization tool do I need a day to prep tool? And if you had multiples how’s that information shared across those different applications?
And that’s where the governance challenge starts right in many of these tools data governance is baked in and it’s very thorough. So data prep tool may you know have a very strict policy on who can access the data sets what can be done and where that data could be written, But ultimately if that data can then be consumed by an Analytics Tool to create some new asset. Where is the lineage? Where’s the tracking of the auditing of what’s being used and how it’s being consumed that becomes much more difficult across all these different tools and none of the governance process here is managed in a holistic way. And I think another thing you have to think about or the operational capabilities. If you’re using different tools and Technologies for each part of this supply chain, you want to be able to manage the ongoing execution and scale and some of these may require their own engine to scale some may require to depend on open source technologies like Hadoop or spark some may run on the cloud or on-premise; others may do both so you’re starting to figure out the logistical challenges of how you’re managing each part of this process.
So let’s take a look at some real world examples of how this can slow time to incite and reduce the overall productivity and visibility across the supply chain.
So first and foremost, I think that self service data discovery and data catalog and capabilities provide, you know, an immediate great. Starting point for all the business reasons we talked about before we’re different data consumers and data producers can identify the data that’s relevant to them. It contains a really valuable information regarding data privacy entitlements the quality of a given data set, how it’s related to other data sets and potentially other policies that need to be provided on top of the data as it’s trying to be used. So this information is critical for any self service data preparation activity but if there’s no integration or integration is limited to some sort of batch export from a catalog the self service data prep users are forced to follow manual governance process, whereby they look up the data and a catalog they see some of the policies that they’re interested in. They learn more about the information than they have to go to a separate application to apply what they’ve learned and that’s a very disjointed and not a terribly productive experience.
Another example could be the need for an analyst to add a new dataset let’s say to the data visualization zone within a data lake. We see this all the time with our customers they’re providing and ingesting and moving data into a specific Zone where it can be used for data visualization capabilities. And let’s say I’m one of the data visualization business analyst and I need a new data set wouldn’t it be great if I could go through the process of identifying data in the catalog and pulling it into the zone following all the governance processes that are set up as part of my overall data lake, but what happens today usually is they have to contact another type of user data producer might be on the it side. They have to go through their processes and it slows down the entire time to insight for using that information.
(11:16: Data Catalogs: Current & Future State)
When we talk to our customers and prospects about the current state or the business impact of the data catalog investment. We usually see that it’s limited. As mentioned, they’re usually great starting off points for new business initiatives or compliance and Regulatory projects but ultimately it’s not enough just understanding where your data is and identifying those relationships is great it gets these Projects off to a great start but people want more and our customers increasingly want to evolve from their passive or Standalone data catalog environment into an active data Hub. And this active data Hub allows them not only the source and understand their data but allows them to use it in a meaningful way. This can include enabling self service data access to different lines of business and non-technical users or the ability to prepare an enriched and ultimately action this data or enable it for use in business or line of business applications or in Analytics Applications. And more and more what we’re seeing is from the catalog our customers want to be able to shop and find data that’s relevant to them. And then maybe pull some of that data, filter that data into something like redshift or maybe snowflake or some other cloud-based data warehousing environment that they’re using for reporting or some sort of analytic project. And the key thing about this is new data sets are enriched or created or their action into one of these new environments. They’re automatically catalogued into the data Hub so that there’s a continuous Evolution and Auditing of all your your data assets with an ultimately to me the key to the active data Hub is that it makes good on the promise to deliver business value fast, but it doesn’t sacrifice or poke holes in your overall corporate or data governance strategy.
So over the course of the remainder of this webinar, we’re going to drill into the active data Hub and look at a couple different areas first. We’re going to look at a maturity model that looks at how organizations can evolve and help them identify where they are and what’s the next steps towards getting towards an active data Hub so that they don’t have to do it all in one shot. Will look pretty deeply at the capabilities of Zaloni data platform and how we’re using it to help our customers Implement an active data Hub strategy, and then we’ll look at some specific use cases that we have from some customers we’ve been working with over the past year.
(13:22: Maturity Curve to Active Data Hubs)
Increasingly organizations are often a little more than cautious to implement enterprise-level data management applications. And this is really for good reason, right? I mean the past track record of the initial Enterprise data warehouses or MDM applications have left a trail of expensive applications that are hard to maintain and usually don’t deliver on the promise business impact with the promise business value. That’s what makes the active data Hub such an appealing concept, right? You can start small, deliver immediate business value and then expand your use case. We often use this maturity curve to counsel our customers as to where they are currently and how we can help to move towards a more active data environment. Typically what we see in most organizations is that they have limited visibility across the Enterprise for all of their data assets. They might have a good line of sight for a specific project or maybe detailed metadata at a technical level, but it often lacks a business context or an operational view of this data. So usually organization start by sourcing and understanding their data and this includes typically building an Enterprise level data catalog, but more frequently. We’re seeing our customers want to start at the department level or perhaps on a specific store of data such as the data lake or set of cloud-based applications. And this really allows these organizations to work on additional metadata in the catalog. Maybe they want to tag hierarchies or taxonomies, it will help research and these things aren’t always fleshed out. So if you can start small and start working on the mayor that won’t become such a burden if you started the Enterprise level.
And once the catalogs built the next step is to allow end users of different skill levels or personas to be able to augment the data or transform it with additional characteristics may be aggregated or drive columns or join it with the third-party data source, and I think the key here is really to say we’re going to start with a set of personas and these personas can be different types of users, right? You may have a user that’s more technical and interested in building data pipelines and they want to take data sets and do complex joins and create a lot of complexity in what their building and in this goes back to the days Enterprise data warehouse and so forth and that’s great. But you want to be able to support that but you also want to have a role for a Persona who can log in as a business analyst who says, you know what I just need these three data sets. I want to bring them into an environment. I want to join them together based on this criteria, and I want to filter out some of the columns on the output. They don’t need to go through the same use case the same set of Technologies as the more IT Centric developer and that’s why I think having a tool that has the ability to support all the different personas. From one environment is really important as we move forward.
And finally, once the new data sets are available in potentially marked as consumable data consumers such as data analyst or any type of line of business analysts can shop for these newly created assets and provision them into an environment of their choosing. This might be a personal workspace on their desktop. For instance. It could be a line of business application or reporting analytic environment like Tableau and the key part of this maturity curve is that it’s cyclical. So anytime new data is created in level 2 or level 3 or if it’s actioned into some other different environment it’s updated in the catalog so that now it’s part of this larger data context and that’s really important because you want it to be something that’s never-ending right anytime new datasets created you want to provide that business context and operational context of that data right back right into the sourcing and understanding phase so that others who log on can find that information and use it if it helps them this prevents data duplication from occurring and really gives you a much better view of your overall data assets.
So really it’s this maturity curve that allows organizations to adopt an active data Hub one thing at a time across; one project or line of business that can truly allow them to provide faster time to value with bigger business impact.
(17:06: So Passive Data Catalog to Active Data Hub for each LOB)
So now let’s take a look at how the maturity curve could be used to help deploy an active data home via a phased approach.
Step One is we defined before is to build the catalog and here it mentions that it’s an Enterprise catalog, but this can easily be a departmental or source-specific catalog and in practice we’ve seen customers start with their data Lake and then add additional enterprise Data warehouses or other different types of data sources for a specific line of business uses. It’s really up to you where you are in your maturity and how was how big you want to start but if you start on something like I’m just going to catalog my data lake or I’m just going to focus on the needs of the specific line of business, It can set you up for early success that you can build upon in later phases.
In step two what we’re trying to accomplish here is the different lines of business may be provided access to the full catalog or perhaps their workspaces or projects created within In the catalog so that its users from a specific department or line of business can get access or Focus only on the data sets that are important or relevant to them. So in this example, let’s say that it’s a financial services company and they want to focus on consumer banking first, they would begin by cataloging those data sets. They can prove out and add the required business and operational metadata that they need and then add additional lines of business like retail or Investment Banking once they’re happy with the initial results.
Step 3 is taking the final step or business analysts who are near and dear to that specific line of business went to find it in the catalog that is required for their job. For Consumer banking perhaps it’s identifying data sets required for a newbie. I report. So once they find the data sets in the catalog and search and find this information they can determine if it’s it for use and ultimately they want to provision this information into the departmental SQL Server instance so that they can plug it directly into their Microstrategy instance and create this new report. So provisioning the data directly from the catalog not only allows them to verify the business usefulness of the data. It creates lineage and a new catalog entry for the new SQL Server data set. So this is really important because in the future another user can use that data set rather than create a duplicate entry or if they see it and they say hey I need to refresh this. They can rerun the provision to refresh the data. So now just like that the data catalog has been turbocharged into an active data. And as the Finserv company becomes more comfortable with this process they can rinse and repeat for other business units as I mentioned before like, retail or Investment Banking.
(19:26: Active Data Hubs)
In the last slide we define the concept of an active data Hub. And in this slide we’ll show how this alone a data platform delivers on that concept. So if we start on the left side of the diagram, you’ll see the various types of data sources that can be connected or collected via the ZDP. A data producer personas are generally given the ability to add new data sources to the catalog. These data producers are often data stewards or data engineers and sometimes have more of an IT slant as part of this. As part of this process the metadata is captured data is profiled and classified to help automate the process of identifying sensitive or potentially not useful data. It’s also here where additional business and technical metadata might be added to further enhance the catalog, you know in practice we’ve seen data stewards or governance teams prevent data from being searchable in the catalog for business users until it has been deemed fit for use. So they might have a sort of bronze silver and gold data and only silver data can be searched by business users. Whereas bronze data can be searched by all the IT team so as they work on improving this information.
If we keep moving to the right data might be enriched or improved via data quality and data Transformations as part of the data management phase. And once again only the most trusted data sets might be available for delivery and provisioning. So again, you can say I’m going to move data from bronze, silver and gold and maybe the certain lines of business can only see data that’s in the gold zone or in the has been marked as fit for use by badging it as gold and then finally data consumers are business applications can quickly search and find and use the most trusted data in the Catalog. So using the selected data platform to create this active data Hub. The data producers feel pretty comfortable because they’re allowing the self service data capabilities, but they’re not giving up the controls in the governance needed to ensure trusted self service data. Whereas the data consumers feel like they’re actively improving the overall data supply chain because any new content they create gets fed right back into the catalog.
(21:36: ZDP (now Arena) delivering Self Service Data Discovery, Catalog & Ingest)
Talking ability of any active data Hub is the ability to discover catalog and ingest data from sources across and even beyond your Enterprise and this is really critical because it’s from this unified catalog a Business, Technical and Operational data that the foundation for all of the other active data Hub capabilities will be built from. So from the ZDP perspective what we really tried to focus on are the several requirements that we think makes our catalog complete.
First and foremost is the ability to catalog everything. Organizations want to catalog relational data sources Data Lakes file systems on cloud storage such as S3 or Microsoft Azure buckets, and they want to get an all encompassing view of this data some of it also may be from third-party applications.
So there’s a few things you want to look at here is first and foremost. I want to be able to automatically inventory this data. So if I can point to a Hive Metastore, I can point to a relational database or I can look at an S3 bucket if changes are being made automatically, consuming and updating the catalog with this information. Increasingly we’ve asked to catalog other catalogs because it may be that an organization is using Azure Data Catalog or Glue in the cloud, but they have a much broader set of data sources and they’re trying to pull that information into this overall active data hub. Another feature that’s really important is the ability to leverage your existing Enterprise definitions and policies that might have been defined in a data governance or metadata tools and this is important because first and foremost you’ve spent a lot of money and time to build out definitions, business glossary terms and models and you want to be able to reference that from a single act of data hub. So from our perspective being able to pull in glossary terms or link those things in allows the producers and consumers of data to get the most comprehensive view of all of the information in a single view in a single search.
Next the ability to annotate or customize for business need the different types of attributes, you might want to store is really important. So whether that’s via key-value pairs or custom attributes or all of the above you want to be able to add additional metadata that may be specific to your business so that it can increase the search capabilities and make more use and more meaningful understanding of the data. And then lastly I think a capability that isn’t always mentioned from a Pure cataloging perspective is the ability to ingest data to a new environment directly from the catalog this is important. When maybe your subject matter expert and you have access to see that a dataset exists and you need some portion of that data set to do your job and you want to ingest it into a different environment, maybe it’s some sort of sandbox environment being able to do that directly from the catalog is really powerful as long as it follows your data governance policies your security policies and obviously updates the lineage of where this data is coming from and ultimately that new data source is added directly back to the catalog.
Once you have a comprehensive catalog available data, the next step in any active data Hub is to allow subject matter experts like business analysts and data scientists to enrich and prepare this data for analytical and operational use
So let’s take a little deeper look at a few of the data prep and enrichment capabilities that will be found in any active data hub. First and foremost is being able to prepare and enrich data directly from the catalog is really a valuable asset. You know, we’ve already talked about how data scientists spend an abnormal amount of time trying to identify relevant and useful data sets. So being able to use the catalog to first search and filter and group data sets based on relevant business technical or operational metadata is just a tremendous productivity gain in and of itself, But ultimately this metadata can then be used to help them determine if there’s proper usage restrictions on the data Based on data quality scores security or privacy concerns. So without this integration in the ability to launch into a self service data activity the data scientist or the business users left to manually search for data and then flip back to another application to perform the social service activities. So now once you found the data, these technologies need to be able to provide an easy to use but robust experience for transforming it in different ways in most cases tools either have a drag-and-drop environment, which we have in ZDP where they allow you to build recipes on the tracking users actions and deploy and then employ them or deploy them on the full dataset only once you’re happy with the sample outcome.
From a governance perspective. It’s really important that any work that’s done from a self service data preparation perspective can be deployed scheduled and managed in the Big Data environment of your choice first and foremost. You want to be able to schedule the recipe to run because it’s going to happen over and over again, right? I want to bring the data in and in that hot fashion, but then on a nightly basis I want to refresh this information and ultimately the resulting data set needs to be added right back into the catalog as part of the active data Hub so it can be consumed or found by additional users and one of the things I think that gets lost a lot in these Self Service Data activities is the ability to use your existing Big Data infrastructure, or your existing data management platform that’s already being managed as part of your it stack. So what you don’t want to do is say I’m going to use this type of technology in this type of infrastructure for my catalog in this type of infrastructure for self service data preparation and data compute. Ideally, you can use and leverage what you’re already using from a big data perspective perhaps its Hadoop, perhaps it’s spark but being able to push this down into those environments give you maximum flexibility.
It’s part of the Zaloni Data Platform, you know, we enable this part of the active data Hub, you can enrich data directly from workflows and transformations within our environment directly into our catalog. And once those Transformations are complete it can be executed nightly hourly every minute based on a scheduling capability that we have within our workflow environment. And as we mentioned before any time a new data is created its immediately added to the catalog for additional consumption or use.
The last piece of the active data Hub puzzle was to provide an environment where both business and it users can create work spaces that allow them to further collaborate customize and ultimately share data with other users or applications in the Zaloni Data Platform users can shop for datasets by adding them to a shopping cart and ultimately provision this data for use in several ways. One way the data can be provisioned is out of the catalog into an existing Enterprise Data warehouse or Data Mart. Simple example here might be an ETL offload process for operational reporting and BI Dashboards. You have a data set. It’s being updated. It can be found in the catalog and on a nightly basis. You want to update your Enterprise data warehouse so that your bi dashboards can be refreshed on a daily basis. That’s a simple use case for provisioning.
Another use case might be publishing data in a hybrid environment to cloud data sources like S3 buckets or potentially directly into redshift or snowflake. Where might be used for a specific project or line of business and we’re seeing this more and more people where from the catalog or from their data Lake. They’re managing data on premise. They’re building new assets. And then ultimately they want to automate the process of sharing those assets with cloud stores that are being used for different types of projects. And the interesting thing there is sometimes once the work is completed the assets might go away in the cloud or they may only be available for a certain period of time or they might be updated and used on a daily basis. Just like the other examples.
Another example of how we can provision data from the ZDP is to enable self service data to be accessed from data visualization tools like Tableau or maybe connecting directly to data science notebooks like Zeppelin or Jupiter. In any event, whether you’re sharing the day with another application or provisioning the our data Marketplace and into an Enterprise data warehouse and active data Hub needs to provide governance processes so that you can ensure that lineage is captured from the movement of data, jobs are being deployed and managed and monitored and Scheduled as needed and ultimately notification is being sent to end users that hey once the data is available. You can consume it again directly from the catalog.
The more siloed or fractured solutions we talked about earlier. This can be difficult because there’s lots of moving pieces and parts, but from the active data home provided by this Zaloni Data Platform, It’s a much more holistic approach.
(29:32: Summary and Key Takeaways)
So before I turn things back over to Brett, I want to quickly review some of the key takeaways from today’s webinar. First as the Data Catalog Market continues to grow and expand. We are starting to see our customers want more capabilities. They want to evolve from the Standalone or passive catalog environment into an active data Hub that allows all types of users to not only source and understand their data but to use it in a more meaningful way to provide more immediate business value.
We also learned that following a maturity model can provide a road map that allows organizations to focus on line of business or Project Specific goals to deliver a faster return on investment and allow more agility to the overall process, right? You don’t have to start and expand across the entire ocean. Let’s focus on a specific line of business, let’s catalog that information, let’s make that actionable and let’s get some real business value from that and then we can add additional lines of business.
And then lastly leveraging the unified solution like ours, the Zaloni Data Platform really allows organizations to achieve your goals in a more holistic way so you can deliver not only business value and days and weeks, but you don’t have to sacrifice your overall governance and security requirements.
So as you think about modernizing your data platform or your data catalogues, please contact us and let us know how we can help you along the active data Hub maturity curve. Thank you so much for your time. I really appreciate it and back to you Brett.
[Brett Carpenter] Unfortunately, we won’t have any time for questions today, but feel free to submit any that might have come up during the presentation through our web page at zaloni.com/contact and we’ll be sure to get them answered. I want to thank all of you for watching this presentation and to Scott for taking the time to speak with us about how organizations can action relevant quality data into their applications for immediate value. Thanks again, and we’ll see you next time.