January 30th, 2020
Traditional data catalog solutions often require a conglomeration of separate tools (multiple catalogs, ETL, data governance, etc.) which are managed in silos by separate teams. When a data analyst needs to derive business value from this data, it requires communication across teams, integrations between products, and a high level of coordination to get them the data they need.
A single platform, on the other hand, provides a single source of truth for analysts to quickly gain access to the data they need in a self-service manner. From source to provisioning, the automated data catalog keeps gears aligned and the train on the track. This reduces the burden on the IT staff, while ensuring the right level of governance over the whole process.
An automated data catalog provides the workflow to take your data from source to value without manual intervention. Allowing a small team to accomplish the same tasks as one much larger. The catalog can automatically bring the data in from the systems of record, execute data quality rules, profile the data, prepare it for consumption, and provision it to the locations where the analysts can use it.
During this webinar, Matthew Monahan, Senior Product Manager at Zaloni, will explore:
– The benefits of a single application over connecting various point-solutions
– How automation from source to destination reduces your workload
– Real-world examples that you can leverage
Read Webinar Transcription:
[Brett Carpenter] Hello everyone, and thank you for joining today’s webinar, from source to Value. Why automation is driving data catalog success. Now, this is the second in a three-part Series where we dive into augmented data catalogues.
If you haven’t had a chance to view the first one yet. There’s a link to it available in the attachments tab located below the player window, but you might want to wait until after this one is over.
My name is Brett Carpenter and I’ll be your MC for this webcast. I’m excited to introduce our speaker for today’s presentation Matthew Monahan who is Zaloni’s senior product manager. Now, we’ll have time to answer your questions at the end of this presentation, so don’t hesitate to ask them at any point using the ask a question box. That’s again located just below the player window. We’ll also have a poll during the webcast which you can participate in using the vote tab when prompted. Now before we begin though, I’d like to briefly introduce who we are and what we do.
(01:12: About Zaloni)
At Zaloni we enable enterprises to leverage their decentralized multi-cloud data environments to gain agility and cost savings while accelerating value from their analytics. Zaloni data platform delivers trusted high-value data through an augmented catalog, exceptional governance and security and easy self-service access for all types of users. We work with the world’s leading companies and powering teams using machine learning and an extensible foundation to conquer today’s data sprawl challenges. Now with that I’ll turn it over to Matthew to talk about the role data automation can play in your data catalog. Matthew.
[Matthew Monahan]: Thank you Brett. I’m really excited to be here for this second part of the three-part series. And today we’re going to talk a lot about data Automation and how data Automation and your data pipeline can save you a lot of extra effort and bring everything together.
(02:13: Agenda and Introduction)
So our agenda for today, we’re going to talk about the benefits of a single application over a collection of Point Solutions. We’re going to talk about leveraging data automation and your data pipeline for improved metadata quality and we’ll talk about applying right size data governance for security and control.
So I would like to start with this quote which I find quite interesting. So “about 91% of organizations sight people and process challenges as the biggest barriers to becoming data-driven” I think what that really tells us is that very often it’s not the tools that drag us down. It’s not technology but it’s getting people to use technology. It’s putting the right technology in place, so that users can easily accomplish their tasks, but to do so in a way that meets the overall business objectives. So from a data analyst or data scientist perspective, they want to get at their data as quickly as possible. But then there’s other concerns within the organization around compliance and governance that need to also have their needs met and so putting the right Tools in place will make it easier to have the right people follow the right process and accomplish what you want to do with your data. But again, we’ll come back to this point throughout the conversation that people and process are the largest challenges and the biggest barriers to becoming successful with your data-driven Pipelines.
(03:50: A Day in the life of your Data)
So what does a data pipeline look like. In its simplest form, the data pipeline is the path that data follows from the time and initially comes into the organization all the way through to when you’re making use of that data, whether you’re using it internally or sharing it externally, there’s a path that data must follow and a number of activities or actions happens to the data along the way. So if we take a look at this one example files come in and sometimes their files and sometimes it’s streams of data. Sometimes it’s hadoop database sources. These files have to come into some kind of location and with data lakes they’ve traditionally come into a single file system or a large Hadoop or hdfs filesystem. One of the things that we’ve seen all a lot of, change in the industry over the past few years has been the idea of rather than collecting all of the data into a single location instead connecting to it where it exists but the fact remains that there’s Source data that that gets generated or brought into your organization at some point and you need to do something with that data you need to connect to it or collect it but then you need to capture metadata. You need to understand what that data is that you’re bringing. There are a number of workflows that will take place that will enhance, refine and prove or manage that data. You may have data quality checks that ensure that the data meets your expectations as it comes in that’s especially important for data that’s brought in from outside of the organization. You may profile the data in a very basic understanding of what types of information are presented or aggregate information about the data. You may perform a data governance test which is tokenization or data masking to ensure that the right people only see the right amount of information. You may start to do some data preps, such as joining multiple data sets together.
You want to ensure that you maintain role-based access to that data as well as to all this new metadata that you’re producing, make sure that the right people have access to it. Those who need it can see it when they need it and those who don’t can request access in an easy fashion.
Along the way you’ll also need to make sure that you have really strong lineage tracking. One of the keys to successful data governance is being able to draw that line and will actually take a look at all of these pieces today. We’re taking a look at the visualization of what that lineage actually represents from source all the way through use of that data.
And then finally the last step is actually being able to take a look within your data catalog find the information you need, but then put it you need to use it whether that’s within a relational database for data analytics or in a sandbox of some sort for data science purposes that data means to get where your users need to actually use it.
So this is just one example of a Data Pipeline and as you can see, you know, 11 steps along this path. There are quite a number of things that happen along this path and one of the things that we’ve found, And in our experiences that there are just a ton of tools out there that help with each of these steps and one of the things that sets Zaloni apart from a lot of the other tools out there and it’s a little bit of a philosophical approach to our view of our data automation platform where we like to cover the entire data pipeline. So from source to provisioning we offer all the capabilities that you need to manage that data Pipeline and what that ensures are two things.
One, You have a single source of information to provide for a lineage. So we’re able to track that lineage all the way across the data Pipeline and the second is the governance and ensuring that governance is happening effectively all across that data pipeline. At the same time we still follow that philosophy of being able to connect to the data where it exists so you don’t have to necessarily put it all into one location and manage all the data access through one location. So using the existing systems are in place for data storage data management data queries, we can supplement that with a data pipeline that provides the governance and lineage you need to see it end to end. So one of the questions we’d like to ask and Brett is going to start a poll for us.
(11:10: Complexity and Chaos)
So let’s talk about all of those different systems. What is the challenge that multiple systems costs? Well, there’s certainly three primary ones that come to mind most frequently and all of this really leads to complexity and sometimes even chaos, the conglomeration of disparate tools can be a lot for one organization to manage successfully. So let’s talk about the first one which is that separate tools lead to see Hours what this means is that with separate teams each managing their own part of the data pipeline. You have someone responsible for bringing data and you have someone else responsible for data quality someone else responsible for data governance. You have all these different silos, different team different tools so that anyone trying to actually manage the entire pipeline, really has a lot of work to do to go talk to each of those individual teams to understand the tools possibly integrate with the tools. And so you end up layering multiple tools on top of each other just to solve the problem of too many duels, ironic, I know. And so one of the things that we’d really like to focus on is simplifying the overall data pipeline.
The second is lineage is really hard to ascertain so lineage is the concept of every bit of data comes from some other bit of death. So when you’re looking at managing your data Pipeline and you’ve got an end output such as a report. So whether it’s a micro strategy report a tableau report or even something Excel how do you know where that data came from? How do you know and how can you assure how that data has been processed? Just along the way and in the event that you’re struggling to either interpret results in a report or to understand how something has changed with the report being able to trace the lineage to go back to the source of the data and troubleshoot a problem or understand the source that can be very difficult to do and so one of the things that we do within within the Zaloni data automation platform is to maintain that lineage as part of the process.
So from the time that the data comes in through all the workflows and processes that happen tracing that lineage is a key value and finally the lack of single governance. This just really goes back to two very similar point, which is that if you have distributed actions from distributed tools throughout the organization, it just creates an additional layer of complexity when you try and manage your governance of that. Data, so if you have separate teams responsible for governance and determining the rules as to who can access with data or what data has to change in order for someone to access it the challenges propagate throughout that cycle and so we’re going to take a look at some specifics and how we can go about managing governance in a single tool.
(14:20: Introduction to the Use Case – ESG)
So what I’d like to do now is actually dive into a use case.
Introduce the use case first and then switch over to our actual data automation platform and show you how we could enable data automation for several of these tasks. So one of the things that’s become very popular lately is from an investment perspective understanding companies from ESG perspective, which is environmental, social and governance. So if you look at any kind of stock application stock ticker, there’s all sorts of numerical data that you can pull down about a particular stock offering. One of the newer sets has become popular lately are around the environmental impact of the organization, their social practices and their corporate governance practices. And so you can get this information from financial institutions and a lot of the portfolio managers today are using that type of information to either manage their portfolios in general or two. Put together specific funds that focus on one or more of those three areas. And so that data comes in and it needs to be made accessible to the people who are going to use it and it may come from several different third-party sources the reliability of that data can vary from day to day. It’s possible that the information changes. It’s possible that the structure of the files change is possible that Information is missing when it’s transmitted. And so there’s a lot of workflow that goes into making sure that data is accurate and reliable before it’s used by portfolio managers to make investment decisions.
So we’ll take a look at that data automation and how coming in will take a look at how we bring that data into the data link and catalog it and it will take a look at the profiling and data quality checks that we can do on that data. And then finally, what we’ll do is provision it out to a relational database for analytics or reporting purposes so that I’m going to switch over to the application itself.
(16:50: Use Case Demo)
So for anyone for whom this is their first view of our Zaloni data platform. We hope you enjoy what I have to show you today. I’m going to walk you through some very simple tasks and some example pieces of this data pipeline from one end all the way through to the other. So let’s start with the ingestion side. So in this particular case what we’re going to take a look at is actually bringing a file in from a third party. Source and bringing it into a data lake. So if I start with the file ingestion, I’m going to show you a very easy application way to do this and then we’ll talk a little bit about data automation. So I’m going to go through and select a file, i’m going to grab one right off of my desktop computer. But what I want to point out is that we can also set up remote connections. So you can set up remote connections on the server side. We can set up FTP connections. There’s a number of different ways to bring files into our data automation platform.
So I’m going to walk you through the desktop version because it’s the easiest to visualize and then we’ll go back and take a look at some of the file pattern methodologies for picking up files that are dropped or retrieved from that TP site. So I’m going to go ahead and choose a delimited files the basic CSV file. So we’ll upload this file into our data automation platform
And what we have already is our system has processed the CSV has taken a look at some of the basic information and it’s providing a very easy way for me to bring this data into my data Lake and so I don’t need to make a whole lot of other decisions here. I might want to expand on the business description or business name make it a little bit more user-friendly I could say for example ESG data from January 30th so I can put whatever information in here makes the most sense. Continue through the and just process we can see that the system has identified the columns from the CSV file. And it gives us the option at the end here if we also want to profile our data. So we have our source coming in the Target data that’s going to be populated and some of the basics we can kick off right right out of the gate such as profiling so I’m going to kick that off and it’s going to run in the background looking cute up and the system will process that data. So in the meantime I’m going to switch over to an existing entity that we have first just real quick, I mentioned some file patterns. So what I set up the ability to bring files in an automated fashion, I have the ability to set up file patterns. So for example, this one here I can take a look at this and I’ve said okay anything in the test data folder that starts with a certain string and then I have a regular expression to find here. So there may be any number of digits following the original file path and it’s a CSV file. So what I can do is Define a file pattern so that it knows to pick up files on a regular basis whether that’s daily or hourly whenever I get the data I can have the system automatically pick up those files and bring them into my system catalog them and have them ready for use. So this is the first step in the data automation ability to automate the process. Bringing that data in again. This is an example of one where I’m pulling the files from a local file system. I can also set up a connection so I can set up a connection out to a FTP system. So as an example here, we have a remote server. So it’s got an IP address and Port all the usual stuff you’d expect to have on an FTP connection. We support FTP as well as SFTP. So for example Server I can choose SFTP. So everything will be encrypted all the way through on Port 22. So we have a number of ways to bring that data in.
So that’s bringing data and let’s take a look at an entity that we already have in the catalog. So we’ll take a look at the raw data first. Here’s some msci metrics again focused on the ESG space. We can see that one of our users has already added some additional text which make the information easier to find and what I’m going to do is take a look at some of the flow that happens here. So take a look at in just history. You can see that this data has been adjusted on a somewhat regular basis, but one of the most important things about data automation is the ability to see what things go wrong. So if I’m looking here and I say no data was ingested for several days that might be a red flag for me to go investigate and see if there’s a problem with the feet of the data coming in or maybe it’s a normal indication that those were non-working days. We don’t normally receive data on the weekends or public holidays. So it gives me some insight right away into what’s happening in the pipeline. So it’s not just about the data automation, but it’s about the ability to see into the data Automation and know what’s happening in my pipeline.
We talked quite a bit about lineage. So it actually like to do is show you what lineage looks like in a very graphical way. So here’s our data source data that we’ve brought in and if we take a look we can see that a process happens and then we have looks like bad data good data report, but I would expand on the good data and this is going to get a little bit dense here, but I want to show some of the new capabilities of being able to bring data into the data automation platform in a very graphical way I can now identify what happens with the data so I can see there’s some work flow that happens. Then the good data gets processed more than it looks like it gets joined with additional data from other sources and eventually makes its way out into some kind of destination use case whether that is a database for analytics or sandboxes. I mentioned earlier for data science. So very visual representation of lineage that we can see the entire course of this data and and this really does work in both directions in the sense that I could have started with that output location and traced back to identify the multiple sources that provided information to that destination at the end.
So now I’ll take a look at Data quality. So data quality is key to having a good reliable data. And so you can see we have two types of data quality is being applied on this on this data one is at the entity level. So that we can have the entire data coming in at self so we could do things like a record count check. So going back to the original Source data. Let’s say that I normally expect somewhere between 500 and 600 reconds coming in each day. I could set a record count check and tell it if I get less than 500 records or more than 600 records to send a notification then what that’ll do is ensure that the data bringing in on a regular basis is the data that I expect and if it’s 480 on one day, maybe that’s okay. But at least I get a notification and I can proactively look into the pipeline rather than waiting until I get a A service call from one of my users saying I’m trying to see this information and it’s missing several companies that I really need this information to do my job. I’d rather be proactive in finding out that there’s a problem as soon as we receive that data. Other type of data quality check that we can do is at the field level So within the file within the columns of the file that I brought in taking a look at the type of data received is another way to validate the data coming in is what I expect. So whether it’s a ticker symbol or a numeric value of some sort. There are a number of different rules that I can apply that will allow me to validate the data is coming in in the format that I expect.
Again, there are any number of things that can go wrong whether it’s the source data itself or some kind of processing problem along the way with the file where data gets corrupted putting in a check on the data as I bring it in and make it available for my users consumption. I can make sure that nothing has gone wrong along the way and have a greater sense of calm. And the data presented to my users.
Okay, so let’s go back now, too. The details and take a look at the field level information. And so we can see here. There’s also additional information provided at each of the fields so we can really start to get a sense for what our data represents taking a look at the format size again, we’ve tried to bring in as much metadata as possible to make it easy for the users to understand. What is this data that I’m taking a look at? What does it mean to me? And how can I use it?
So once the data user has determined that the data they need throughout this pipeline process is in fact the data they want to use, now they need to do something with it. So what I’m going to do is I’m going to add this data to my card. And let’s say that I’m doing analytics. I may want all the rows and all the columns to be made available for me to use in my report. But maybe this is a particularly large data set and I don’t need all of it. I’m just doing some data science exploration. I might choose a sample set so I could choose the either number of records that I want or a percentage of records that I want. I can specify a filter. So use some actual query language, choose a field. Set some limits on that. In addition to controlling the output at the row level. I can also choose at the column level. So we have quite a few columns here this data source. I may not need all of these or it may be that set up a data pipeline for specific type of user who isnt authorized to see all of this information so I can pull in those columns that are needed.
Someone is bringing new columns and we’ll save that to our cart. So now I’ve set up my data that I want to make available to a particular user and go to my cart and actually provision it.
So let’s go ahead and provision this data.
And there’s a number of different ways that we can provision data. We can provision data out to a new entity within our existing data link or we can also send it out to a remote server or to a relational database to give you an example of some of these if we’re going to continue to do additional processing on this data set. We may want to leave it within the high of environment if we are going out to something like a tableau or micro Matching our Excel for reporting that relational database may make the most sense another common use case that we have with our customers is the ability to
Push the data back out to remote server. So one use case and we are going to have time to go into depth on this one during today’s webinar is a insurance company that gets multiple files of input from their providers with claims information. They bring that into the data like the process of run quality checks to a number of other workflow actions, and they run that against there. Business rules that allow them to determine whether or not claim should I pay and how much they should be paid out and then they have to return that information on the payments back to those same providers. And so with this pipeline they’re able to automate that entire process where they bring the data in run it through all the work flows and then push the data back out where those providers can then pick up the resulting payment information to understand whether They need to build their patients for the remainder or if everything has been covered. Today, we’ll take a look at the relational data example, and I’m going to push this out to a MySQL server. You can see we have support for a number of different relational databases will push it out to MySQL. I’ve already established this connection. So all the information is pulled in and I can test the connection to make sure everything’s working. There we go connection was successful.
All right, so we will now review all of the information here. So we have our details and what we’re provisioning. We can see where the destination is and we will submit that provisioning workflow. And so that provisioning now is going to process pull the data that you requested and send it out to my relational or my SQL database. And that’s the database that now I can give access to pay to the users that they can perform any of their required analytics or reporting. So taking it through. I know it’s been quite a few areas. What I wanted to do is to demonstrate from end to end the entire pipeline all within one tool that are a lot of areas that we didn’t have time to go into more depth on but we have an addition to some basic workflows. We also have Transformations. So to take a look at some areas here. We have the ability to do aggregations. We can do filtering joins. So there’s a number of transformations. Actions that we can actually do on the data. So this starts to get us into some of that light weight data prep space. We also have if we take a look at our workflows the ability to bring in some really incredible.
We can bring in some data profiling, we even have the ability to do data mastering we can. I can run code through a custom Java file. I can bring things in through mapreduce or pie spark. I can write python code. So there’s a lot of different ways that I can actually execute workflows right from within this data automation platform. And for a lot of our users that meet their needs on doing the basic processing that they need to do on the data. And if for any reason it doesn’t one of the other things As we can do is actually make rest calls. So we have a number of customers that use this to trigger additional actions that do happen in other systems. So going back to the healthcare scenario processing insurance claims. One of the things that you can do is process the core information the data that you get then make a rest call to your business rules engine that focuses specifically on claims processing.
Results back and continue from there. So the platform has a very high level of extensibility and in the capabilities around being able to execute specific workflows and code.
So I think that’s everything that I was going to share in the demo today. So I think we have just a couple more slides to wrap things up. And I will remind you that at any point if you have any questions, there’s a questions window down below the viewer and I encourage you to ask any questions that you have about automating the data pipeline.
(36:10: Recap and Summary)
So just to recap we talked about the benefits of the single application over collection of Point Solutions demonstrated how to leverage data automation of the data pipeline to improve the overall quality of the data and metadata that you supply to your users while still applying the right size data governance for security and control.
And this reinforces what we started off by saying that we’ve designed our data automation platform to be a comprehensive solution that does go and we call this our 3 Cs; catalog control and consume capabilities, and I’ll wrap up here with turn it back over to Brett to see if we have any questions.
[Brett Carpenter]: That’s great. Thanks. And yeah, like you said if you have any questions use the ask a question. And if they spark any other questions that you may have, you know, feel free to drop them in that box. But the first one that we got is,
Can I set how long provision data is available?
[Matthew Monahan]That’s a good question. So wen data becomes provisioned. It depends on how you provision. So there are three ways that data can be provisioned today. You can provision it out to the existing Hive data Lake you can provision it to a relational database where you can provision it out to a remote server in terms of data controls on that. There are ways that you can
Within the workflow follow-up that provisioning with a removal of that data so when we provision data, it’s we’re actually provisioning the data in a way that makes it available within your existing system concern. So for example, if you provision out to a mySQL database as we did in the demo, then the mySQL database is the system that manages and controls access to the data. So the users that have Access within that data automation platform can access the data but we can do is automate through workflow the ability to remove that data as it’s no longer needed in that scenario one of the things that we’re working on for just to get into a little bit of a roadmap item here. We’re working to have some tighter integration with data science notebooks and sandboxes. And as we do that we’re looking at ways to have those sandboxes to
Online with a specific lifespan. So in that case the sandbox isn’t necessarily an existing database because you’re not necessarily provisioning to an existing database where you do lots of reporting or you do other activities, but we’d rather do instead is spin up a new database that’s designed to be just part of that sandbox environment. And so we have one customer today that is working with our system to do just that. They are in fact using our system to initiate the request that goes through an approval process and then allows a user to get access to a Sandbox, which does have a defined life span for that box itself. So there are certainly ways to do that and happy to chat more with anyone if they like us to talk more about how we’ve done that for one of our customers.
[Brett Carpenter] And kind of along those same lines another question that Does the data provisioning update the lineage of a given entity?
[Matthew Monahan] Good Question and I will have to check to figure out which version specifically added that capability but yes, that is something that we’ve worked on so that when we provision to a database that that provisioning action so today it does create the additional entity and We have the ability to do is to catalog that entity as well so that it shows up as not only part of the catalog. But in that overall flow of the data pipeline, so from again from end to end the idea is to capture everything that happens with that data from the first ingestion of that data or connection to that data all the way through to final provisioning and use of that data.
[Brett Carpenter] Awesome lets see another one we have is can I share data sets that have been refined with others in the catalog.
[Matthew Monahan] Yes, so there are a number of different ways to share data. So you can share between project just switch back for a moment here and take a look at one specific area. So if you take a look I’ve got here my project listed which is responsible investment. And there are a number of other projects also within this instance. So a project think of it almost like a team that brings together a number of data.
Data sets and number of workflows do the things that you need those roles that are associated with the projects and who can do what with the data within that project so you have the ability to share. Items from within the catalog with other projects
So from here, for example, if I look at Ones that I own within this project, if I choose one of these I can share this with another project. Very easy way to add in fact, one of our customers is using the public project which is a default project as a really good way to manage sort of an overall catalog. So no one really has access to do anything but the view in the public project and the way they have configured it so that any user can come in and see that data is available but to be able to go a layer deep or to do anything with the data, they would have to bring Data set into one of their specific projects. So it’s a great way to provide that whole inventory of all the data that’s available while still maintaining control over who has access to the data. So there’s a question. Yes, you can share data and you can control what level of access people have to adapt as well.
[Brett Carpenter]Awesome. Let’s see. I think this might be the last one that says if I’m using an existing process for my data pipeline, can I still leverage your catalog and consumption capabilities or must I use your pipeline process, too.
[Matthew Monahan] No, there’s no requirement to use our pipeline depending on what you’re trying to do. We do have some customers that focus on using our data automation platform for the ingestion and catalog side of things if you do that, and actually, I’ll use a specific example from one of our customers is looking to enhance the way they use our platform today. So they want to that. They have a number of processes that happen outside of our platform and They still want to be able to trace certain activities via lineage within our platform, even though the actions are happening outside. So we also offer a set of APIs that allow you to specify lineage records from an external source, and so they’re able to manage the data ingestion and cataloging within the Zaloni data automation platform, but then call the APIs that we provide in order to draw the lineage. From those items in the catalog to things that are happening outside of our platform. So it’s definitely something that could be done and we have customers doing that today.
[Brett Carpenter] All right, very cool. Okay. Yeah, so I think I think that’s all the questions that we have. If you’re watching this and realize, you know a question came up and you want to get an answer to it. Just shoot us an email at info as alone e.com same as if you’re watching this on-demand and have a question to shoot us an email. And I want to want to thank all of you for joining this presentation and to Matthew for taking the time to speak with us today about how automating your data catalog can bring success to your projects again. This was part two of the series. So watch the first part and again, the link is in the attachments Tab and stay tuned for. For more information on upcoming part of three. This presentation will be available on-demand both on the Zaloni resources Hub and on brighttalk for future viewing and the slides will also be available in the attachments tab again located just below the player window. Thanks again, and we’ll see you next time.