March 27th, 2019
Today’s enterprises need a faster way to get to business insights. That means broader access to high-value analytics data to support a wide array of use cases. Moving data repositories to the cloud is a natural step. Companies need to create a modern, scalable infrastructure for that data. At the same time, controls must be in place to safeguard data privacy and comply with regulatory requirements.
In this webinar, Zaloni will share its experience and best practices for creating flexible, responsive, and cost-effective data lakes for advanced analytics that leverage Amazon Web Services (AWS). Zaloni’s reference solution architecture for a data lake on AWS is governed, scalable, and incorporates the self-service Zaloni Data Platform (ZDP).
Learn how to:
– Create a flexible and responsive data platform at minimal operational cost.
– Use a self-service data catalog to identify enterprise-wide actionable insights.
– Empower your users to immediately discover and provision the data they need.
[Ryan Peterson]: Hello everyone thanks you for attending today’s webinar and Welcome to a governed self-service data platform accelerates insights. My name’s Ryan Peterson. I’m the global technology segment lead for data and Amazon web services. I’m going to be the host and moderator for today’s webinar. I’m joined today by Scott Gidley vice president of product management of Zaloni. He’s going to go over a bit about data catalogs along with solution architect for an S3 Data Lake. And before we get started with that I’m going to discuss some things around what AWS thinks around governance.
(02:00: Learning Objectives)
We’re going to talk about creating a flexible response data platform with minimal operational costs that are going to talk about self-service data catalog’s to identify enterprise-wide actionable insights and how to empower users to immediately discover and provision that data that they need. Invest customers consistently tell us there’s five major reasons that they choose a tree us at a high level in the first one is around agility the ability of very quickly spin up the equipment that you need to use the various software that you want to use. Elasticities to be able to scale that in very automated ways and controlled ways to the lets you grow without having to administer a large-scale environment. The globalization opportunities. So as you scale your business to multiple different countries to be able to copy and paste same type of infrastructure and different region and also maintain all of those different compliances in those regions. From a cost perspective instead of having to buy equipment figure out how much you need and sometimes end up buying too much equipment and having to re provision things cost becomes something that’s very manageable in an AWS environment and finally functionality with thousands of new features every year we Look to create and expand upon an ever-growing list of capabilities functionality.
But we’re here because we see a growing Trend the front page of the news seems to be always littered with these problems of PII breach incidents and just in the last couple of minutes that I’ve been talking we’ve seen over 10,000 records stolen from from some bad actor. You can see here that in this in 2017 there were 1,700 consumer incidents with 72% of those. Being an outsider doctor and we spend a lot of time thinking about how do you stop an outside attacker from getting this data, but the things that really matter are the internal attackers the people that you trust to do the right thing with your data, how do you manage your data environment in such a way that those insiders have the right access they need without getting too much access or getting wrong kind of access or keeping it too long. And so we want to talk about that a little bit today is. Staggering Number those in the end of this Just billions of Records stolen which basically means half of the world’s population will have seen a data breach within the year. We really would like to see our customers work with us to try to reduce those counts.
What problems are customers trying to solve
So we brought this question to the customers, you know, what do you know about your data? Where is it? We spoke to various c-level Executives and it was really amazing. How many of them had these kinds of questions? So to throw a few of them out there what type of information by collecting where do I collect it from? Do I have a legal statements in my my managing the downstream? So as my analysts are working on data, are they using the data respectful of the consumer collection statement that they collected so they say they won’t use data for certain purpose. Are they following along without or are they doing the wrong thing in the right thing? How do they know if Rogue agents even took the inner misused it until it’s too late and as we’ve all seen in many of these these big hacks. It’s been the c-level executives that have had the biggest strength think about it some of these really big ones being biggest loss of financial value for a company could come because of the breach because of one person within an IT organization. You might do the wrong thing or any organizations for that matter.
How are privacy regulators protecting consumers?
So CEOs are not the only ones that are worried about this problem The Regulators are out there also worried about this problem. There’s been plenty of regulations over the many years that have had various privacy requirements PCI may be a good example of that where it’s about privatizing the payment card transactions to try to make it so that there’s less theft but really the big hundred pound gorilla on the Block is this GDPR that come out of Europe and then the equivalent CCPA of California. I haven’t even had the time to update this slide to include things like the Brazilian requirements the Australian requirements the Canadian requirements and just dozens of others throughout various countries and dozens more that are coming up with different requirements and are going to be different for each country and we’re going to try to have to deal with all of these different privacy regimes
What is the root cause of a breach
So as me and my team started thinking about what what is the root cause of all of this we came to the conclusion that it’s PII and the moment that you collect PII or connect to it you develop some sort of risk of misuse of that data the moment that you store that data outside of collection. You don’t have additional risk out of that and I’ll meet you use that data you’ve added additional risk to storage and collection, but we’re really gets tricky is when you start proliferating data you don’t understand where the data was collected and where its proliferated to that can become a bit of a challenge and so the need for data catalog. It becomes really critical as we move PII throughout the organization for various purposes. Finally the deletion of PII and knowing when it’s appropriate and also when it’s required legally
Major considerations of the design
So we started designing out a solution. We came up with a few different design criteria with all of the normal things. You would expect around for example, having a data catalog and have all the common systems like ETLing etc all need to be on there. But do we suggest ETL versus ELT? And well ELT says we extract we load then we transform. The challenge of that is that the data is now already landed in the other environment the PII exists, which means that the storage administrators the file system administrators any person with permissions to change those formations now. We were worried about the proliferation more than the collection. We realize that in order to do business. You probably need to be able to ship somebody a products you’re going to have to have their address in order to deal with customer service. You need the phone number or email address.
We realize that hashing is we think better than encryption. Encryption may be needed for particular purposes. But if you can get away with that hashing and it’s a one-way process and can’t be turned back into the original consumer. Finally. We get into matching systems if you like decentralized matching is better than Centralized matching only because if you take a lot of different PII and you put it into a single place for matching the new created a Honeypot hackers.
So that led us to build an architecture as a lot of little icons on here and I want to scare people. It’s really just intended to say there’s a lot of options for things that can be used to do various elements. But overall, these are different things that can be done. If you should need it. So an example of where you may or may not have a need as if you collect a lot of video and images and you have handwriting on those images where might be PII then, you know, when to use our recognition service to be able to grab all of that information off the image.
If your screening log data through Kinesis. You can use fire hose to ETL at content and ultimately get the data in as well. So each thing has a different purpose. But today we’re going to do is focus our energy and talking about how the data Lake ultimately transitions information data catalog and why the data catalog is so critical. Before I hand it over to Scott talk about what Zaloni does and how the data catalog works at Zaloni. I wanted to kind of give you an idea of a metaphor of my kids. They have Legos all over the house and I asked him what he wanted building Legos and they have no idea where to start because they’re in the world that Legos are so we have a day we go. Okay, we’re going to go collect all the legos I’m going to go put them back in the bin and then what we do that kind count to see how many of these we have two Wings becoming building an airplane on as we have wheels to build a car and in reality it really stuck for me is we have data if I don’t have knowledge of all the different elements of all the pieces that I have to build an Insight. How can I possibly get the right value insight of it? My data so such me we got to figure out where all the wings are world of tires where all the information is that we get to the right appropriate output so that when I handed over to Scott and Scott kind of go through what so long he does to support this governance architecture.
(11:16: About ZALONI)
[Scott Gidley]: Hey Ryan. Thanks for the introduction and welcome everyone. Thank you for taking some time out of your day to learn a little bit about a self-service data platform that focuses on data cataloging is the foundation to make data producers and consumers a little more efficient in their work. But still drive a lot of the governance processes that organizations require so that they can meet all the Regulatory Compliance initiatives but still process data at scale.
(Data modernization challenges)
So as we go through my presentation, I want to take a look at some of the data modernization challenges that our customers have faced how we initially approached that with a solution that help Drive data Lake management data Lake Enterprise sort of data management capabilities. But as we look to enable more of the connect versus collect architecture that Ryan mentioned where you’re not proliferating data into a common area just to do it, but you’re maybe being a little more thoughtful About how and when and who moves the data into these different environments. so, you know traditionally Zaloni has managed enterprise-level data Lakes for organizations for many many years. Now. We see these common data modernization challenges that our customers and different organizations are facing right. They want to be a little more efficient with how quickly they can get data made available for insights, but they’re struggling with architectural complexity. the Big Data landscape whether you’re building out. Out Solutions on-premise or in the cloud there’s lots of different toolong the provides lots of different capabilities and you need to figure out how you’re going to stitch these various tools together in a way that not only performs the capability that you need but also adheres to your internal data governance processes. and this is driven sort of a lack of skill sets right people who are more familiar with Enterprise data warehouses were sequential structure. They may be using tooling that’s a little less familiar to them and maybe traditionally or in the past and focus on learning map reduce and now maybe they’re working with spark or some different types of technologies that they’re having to raise their skill set up at the same time that they’re trying to solve a business problem and certainly managing data across cloud and hybrid environments can be challenging. Although that’s becoming more and more commoditized and it’s easier to do with all the tooling that not only AWS provides that Solutions like Zaloni offer that as well. As you look at this architectural complex complexity it dries then into how are you managing your data and not forgoing all of the data quality data governance data security principles that maybe you had in place for years and years and really the key to that the Hub of all that is metadata management. How are we capturing the proper Business technical and operational metadata, so that end users can easily answer the questions of while I collected this information. What information am I collecting are is the data being used appropriately is it being secure appropriately and being able to then gather? There’s any risk or misuse of this information throughout the Enterprise that really has to start on the foundation of metadata management and ultimately a data catalog.
And then ultimately as you build out this new data modernization platform you want to have Automation and intelligence built into that automation so that it’s it’s simply rinse and repeat or you can start making recommendations based on previous work or previous data sources may be that you catalog that you can recommend different types of data governance or data lifecycle policies for similar types of data. And then finally as we expand all of this out the ability to govern this at scale across the Enterprise or across data sources that may be outside your Enterprise are important so you don’t want to drive beta beta swamps where you have pockets of information that may not be collected or connected to so that you can understand why that information might be useful. You don’t want to create ponds where you’re duplicating the same type of data governance policies where to change one. You have to change it in multiple places. You want to try and do this in a more holistic way.
(15: 20: Why Catalogs are key)
And I think you know from Zaloni perspective and I was talking to Ryan about this we started off saying okay the data lake is the way to go here. We’re going to bring data in we’re going to pursue maybe a Zone based architecture where there’s a landing Zone and maybe it’s transient in nature where all of this raw information comes in and then we only move certain information into the Raw Zone. OR The Trusted Zone, and that’s where perhaps some of the tokenization or encryption is done for hashing is done, to Anonymize information is needed and then there’s maybe further refinements and more ETL types of processing that’s done that moves the information into a refined or sandbox Zone where your data scientists are your business analyst can get access to it. And I think that that works very well and it’s worked really well for many of our customers, but I think it also sort of creates more of that proliferation of information because as data comes into your transient Zone, it’s the raw information. That’s maybe not been tokenized or hashed or encrypted. It’s a copy of information that’s coming from a source system and then being moved somewhere else and you have to worry about what that proliferation might do from a misuse or some sort of breach of information. So we started pivoting to more of a data catalog approach where we want you to catalog all the information at its source being able to Define policies around who should have access to that information. What information in that data source should be accessible and further describe the use cases of that data and I think that that really drives a lot of what we’re seeing from sort of why data catalogues have become so popular right. New data initiatives people want to find out what data is available to them according to Gartner by 2020 organizations that provide access to curate a catalog of internal and external assets will drive twice as much business value from analytic Investments than those that do not. So again, it’s making sure that people can quickly and easily see what data is out there and more importantly what Data is relevant for them and how they should use it. And I think this is where being able to provide some level of context and meaning to the usefulness of data can jump-start these new business initiatives.
Regulatory Compliance. I’m not going to spend too much time here Ryan mentioned this quite a bit. But the fact of the matter is any Regulatory Compliance is built on a foundation of being able to document, audit and trace information assets and that includes the lineage of where maybe the data came from how it was used and transformed over time who perform these Transformations who has access to the data and maybe any sort of certainly any PII or any secure information that is being stored in these particular data sources a data catalog is a great way to be able to catalog data and its source and provide this map of information across and beyond your Enterprise. And then finally, you know wouldn’t be a data management presentation. If we didn’t talk about, you know, these use cases for data scientists and then being able to spend a large portion of their time depending on the the quotes that you’ll see they spend anywhere between 20 and 80 percent of their time trying to identify collect and prepare relevant data sets for the use cases and the running joke is they spend the rest of the time complaining about how long it takes for them to find relevant data sets to use and then use cases. So I think a data catalog being able to more quickly help the data consumer, whether their data scientists business analysts data stewards. So forth be able to find information that’s relevant to them and then share and collaborate on that with other people in their projects is really important and really this has been driven out of from Zaloni perspective of our data Lake or data Lake management offerings. This is something that’s critical as you add more and more data sets. We have customers who have tens of thousands if not hundreds of thousands different types of assets they’re managing. They All the different lines of business to be able to log on, browse the catalog, find data based on a rich set of facets and filters that they can use to find information that’s going to help them do their job more effectively and we’ve got some example use cases, coming up
(19:44: Data Catalogs: Current and Future State)
So one of the things that I think is kind of interesting is data catalogues isn’t a brand new technology has been around for several years and they’ve become more and more popular. And I think there’s two different types of data catalog and capabilities. They’re sort of what I’m saying, the pure-play data catalog which is primarily used for inventorying and identification of information and they often use machine learning capabilities to provide really complex really valuable data classification of elements, right? So, how do I identify all of my name or my address or maybe more importantly social security number or credit card related information and they’re really helpful in these regulatory use cases where you want to be able to audit and find data of this different type and there are also often used in conjunction with existing perhaps Enterprise meta data or data governance solutions to feed information into those. And then there’s also catalogs. I think that are embedded more into applications to make those applications more useful. So things like analytic applications or data management applications may have a catalog and they have a lot of the same features of the single purpose of pure play catalogs, but they’re generally geared on improving the inventory of data or data bits known to that particular application. So and a catalog that’s integrated into an analytic application might expand the catalog to not just be data sets or files or things of that nature maybe they’re expanding it to include analytic models that are being managed within the analytic platform and even ours Loney data platform. We provide a standalone data catalog, but it is very it can be very well integrated with our overall platform. So if you want to just use the catalog, that’s fine. But if you want to integrate it with some of our data movement and data transformation capabilities, you can do that as well. I think as we go there’s really no right or wrong. Right if Bill it’s really based on your usage case your your scenario within your organization. I think for all of these catalogs. We’re going to continue to see a growth and capabilities around data collaboration. How do I share this information? How do I use the wisdom of crowds to make the most relevant information more easily locatable within the catalog and then we started seeing more and more requests for things like data usage reporting via impact analysis and other types of things and this really can drive back into what Ryan was talking about in some of the use cases from a from a regulatory perspective is how are people using data we’ve had customers requests reports to find how many people have requested access to data or tried to access data that they don’t have access to and they just want to keep putting idea of who’s maybe trying to find and use information that they’re not licensed to use. So I think those are all sort of going to be cross catalog capabilities that continue to be provided. But ultimately we are seeing more and more our customers want to have an actionable data catalog. They want to have more of a Marketplace environment where not only can they use the catalog to find data? They can see profile information. They can see Lenny aging impact analysis and all of those things, but then they want to be able to transform or potentially provision that data into a Sandbox environment.
And that sandbox could be a data Lake. It could be something like redshift where they’re moving data into a data warehouse or some specific use or some specific data Mart. That’s really powerful because in I’ll go through some examples of why there’s a struggle right now in the industry where there’s too many separations of concerns across these different types of tools that actually can drive a hole through your governance process. If you have a from the catalog the ability to transfer data, maybe do some light wave manipulation or provision this information, it can also immediately update the catalog and keep it current so you want to maintain that catalog currency so that as soon as there’s a change or soon as this data is moved somewhere else. It’s immediately updated. You have a reflection from a lineage and impact analysis perspective. So this is what we started to see in as we build out more of our self-service data platform. It includes not only the data catalog which is sort of the the connect part of this overall platform, but also, the ability to provision and transform data, which does more of the data collection that allows you to do it in a little more intelligent way than maybe we’ve done before.
(24:00: Self Service Data Platform: Discover, Catalog and Ingest)
So as we as we drive through and down into some of the details with regards to what we see in our catalog or what we see our customers at asking for is really sort of three main areas. You can break this down into their sort of the discovery catalog and perhaps some self-service ingestion capabilities, then they’re sort of the the self-service data preparation and provisioning and manipulation of the data and then there is the marketplace or the ability to collaborate and share this information. So I’ll go through some Specific examples of each of these here
From a discovery and catalog perspective. I think the most important thing is you evaluate data catalog and Technologies as you want to be able to catalog as much information as possible. So certainly relational databases data lakes or part of that. But you also want the ability to do some automated data inventory and what I mean by automated data inventory is you don’t want to have to process or transform the data in any way for it to be added to your catalog you want to be able to see an S3 data lake bucket you want to be able to see a file system and catalog extract metadata from the individual files and tag those files as part of the catalog so you want to be able to crawl and see these different types of standards that might be out there. There’s different types of data sources that might be out there right talking earlier about the need for people to be able to catalog things like emails or wikis or other types of content Management systems, and whether that’s you know, part of the overall cataloging processors that something Current and we see it moving in that direction. So either via apis or the ability to pull information or from those systems and be able to add that information into that catalog I think there’s value in that as well.
The other thing I think you want to see from a catalog is the ability to leverage Enterprise definitions and standards and what I mean by that and why this is valuable. We work with lots of customers who have spent a lot of time and effort and money to build out Enterprise data governance. Solutions or Enterprise metadata Management Solutions where they’ve created their Enterprise version of customer and they want to have a certain set of business terms that fully describes what information makesup customer what data quality rules might be needed to apply to customer related information and then certainly other information around who can have access to this data and they’ve stored that in something like an IBM governance catalog or potentially calibra, and they want to integrate that into their catalog so that when I browse for customer related data, I can see these business terms Associated directly with that. Sp One of the things that we provide as a meta data exchange framework, which allows us to incorporate information from a lot of these corporate Enterprise standards into our catalog.
Another thing that we see people who wanting to do is catalog their catalogs, right? So from an AWS perspective glues a really nice catalog capability for S3 Data Lake / buckets and potentially even information that’s on premise and if you’re using that for certain use cases, but you’re looking for broader. Capabilities the ability to read information from glue or update information and glue from our catalog is something that we see customers asking for over and over again.
And then certainly for any cataloging capability the ability to annotate and further customize attributes based on your business needs is important, you know, as a provider of the this technology the matter what every customer has their own requirements, they want to create an attribute that’s specific to them around line of business or data owner or whatever. It might be allowing them the flexibility to create your attributes and have attributes stored and saved in their catalog But ultimately part of this search capability of the catalog so you can quickly search on any data. That’s part of my marketing line of business is really important. So having that flexibility is something you should definitely look for. And then I think something that’s sort of growing in request and again part of making it an actionable catalog is the ability to empower subject matter experts not only to discover new data, but potentially ingested into a new environment or potentially a data Lake environment and we’ll talk through some of our use cases where we have customers that are cataloging teradata related information as well as some other third-party data services that they provided that they have a private to them and they want to pull information into an S3 Data Lake / bucket, so that It ultimately can be loaded into redshift or query Beyond Athena and being able to do that from the catalog. If you have the right permissions or the right type of user saves them a lot of effort from an IT perspective of having to have a different type of application that makes that possible.
Another area that I mentioned that we’ve built as catalog we’ve made it extensible. It’s searchable and people can find information relevance. They have access to potentially a third party or enterprise-level data governance standards. The next thing they may want to do is drive some self-service data preparation or provisioning capability. They may want to move this information from from one environment to another And in this particular case, I think you know self-service day of preparation capabilities are pretty well understood in the market and it’s become more and more of a mature capability of being able to drive this directly from the catalog is really powerful, you know, and there’s I was reading a Forbes article where again there was a crowdflower survey data scientist who said they were spending somewhere between nineteen twenty percent of their time collecting or identifying this information before they could then move forward and prepare or manage it and if those capabilities aren’t available in the same interface or thing on part of the same overall governance process. Then there can be a hole in your overall data governance strategy where you can find information you’re doing something creating some new artifact from that. Are you sure that that’s following the same governance processes that you would want from before so being able to do all of this from a metadata catalog allows us to allow you to do the transformation and update the catalog immediately and that potentially set permissions or rolls on who has access to this new artifact.
So it allows you to drive a lot of those sort of data privacy concerns directly from the catalog through any newly created data access. The other thing. I think that’s important to consider here is that you want to be able to scale out this capability and not have it be something that’s driven just from a single end user perspective and what I mean by that is you want to be able to do some of the self-service data preparation and schedule and manage and operationalize it based on your data infrastructure. So when we talk about some of the use case examples that only provides we are going to be spinning up EMR clusters potentially within AWS to do any of the data processing that needs to be performed to Drive some of these data Transformations, whether that’s tokenizing data joining two data sets together together filtering out information and so forth. So we wanted to be able to Scale based on your internal or external infrastructure whether that’s on premise or in the cloud and you want it to be able to be governed by your data management policy. So it may be that I have the ability as a catalog user to find data assets. I can build a recipe here to do some Transformations, but it actually has to The executed by Ryan because he’s the person who has the overall data stewardship or ownership of the data itself so we can generate the workflow or generate the recipe that’s going to be used to execute this but it could be executed by somebody else. So again having it managed by the policies that you want to run from an IT perspective is important, even if you’re enabling some of this for your business end users
And then lastly I think the ability to collaborate and share this information and provision it out to other parts of your organization or other applications who want to use it is probably the number one request. We’ve been getting from our customers. So they query the catalog they found these four data sets one has customer data one a sales transaction data and one has some sort of marketing leads information. I want to take these three data sets. I want to add them to The shopping cart and then I want to move them to Redshift because that’s going to be used by. You know some reporting application to drive some sort of bi report or some sort of analytic process. Right? So I want to be able to Simply do that from the catalog and not have to drive a lot of other it processes to make that happen. So as you’re evaluating cataloging technology is the ability to create work spaces where you can take multiple datasets who might what that might be from different parts of the organization or different projects move them into a collaborative workspace again continue to annotate any of the metadata that are associated with these assets and then provision them through a sort of a self-service wizard where you can add them to a shopping cart deliver it to it data science environment, whether it be a data lake or particular Zone in the data lake or in this case red shift is an example that we’ll use from one of our customers or directly enable it in a data visualization tool like Tableau or any of the other ones that are available for being able to say hey I want to pull this information directly into Tableau can be useful and again as this information happens as it’s getting pulled into Tableau. It’s updating the catalog with this new data asset if it’s been moved into a new environment that’s immediately searchable in the catalog. There’s lineage to show which data set that came from within the catalog and what Transformations might have been provided or placed on top of it as it was moved into this new to this new storage container again, whether it’s Tableau or whether its redshift or that are dated workspace and those types of things. So I think that’s probably the biggest key is this self-service platform has to expand your entire data lifecycle data management policies. You don’t want to leave any holes in there because then that opens up some new artifact as Ryan mentioned earlier than a bad actor could come in and maybe get access to in your proliferating more data than necessary to ultimately solve business problems and it’s at hand
(35:00: Solution architect for Amazon S3 Data Lake
Okay, so we talked about some of the capabilities that we provide in this state of the marketplace in the self-service data platform, especially through the catalog. Now, let’s look from an architectural perspective what it might look like from an Amazon implementation, right? So on the left as you look across this diagram, you’ll see things like the source databases files and streams of information that might be coming into your VPC. So certainly from Zaloni Data platform perspective we can integrate with the AWS directory service syrup using active directory or LDAP or some other means on-premise to manage users. You can reuse a lot of that capability. So this is where the platform can then catalog these sources. So the database is in the files themselves at that point. If you want to manage ingestion, you can pull the data in to S3 Data Lake if you want to, which it may ultimately want to end up in something like redshift but in each Part of this process. We are capturing and tagging the metadata and making that available within the catalog. So again, if we come from the left right here, you see things like Landing zones raw zones trusted zones think of these as S3 buckets that maybe have been set up to manage your S3 data lake and as data comes in from your databases and files. We capture the metadata of where that data was. We capture the lineage of any Transformations that maybe have made against it. We could apply data quality or data tokenization and encryption of the data. So maybe Landing zone is some sort of temporary, you know EFS storage where we’re bringing data in capturing only the data. We need applying tokenization before any of that information gets moved to the raw Zone then the landing Zone may be a transient thing where it goes away. So you stop some of the proliferation of that data. Then you have your catalog users who can now see the information in the raw zone if they have access to it, or they could see data in the raw zone They had don’t have access to it. Can I request access to get him some of this information and this is where as we make more transformations to the data from raw to trusted too refined. We can spin up and Amazon EMR cluster deploy spark or mapreduce capabilities to it. It could be scaled based on your needs. We have customers who might burst up their EMR processing for a certain period during the day but they may only keep a few clusters that are resident all the few nodes in the cluster that are resident all the time for their for their execution capabilities. And then ultimately the data is being stored and processed in S3 data lake can immediately become queryable From things like Amazon Athena or potentially the elastic search service and you can use our provisioning capability as I talked about before the ability to have our Marketplace to see data that’s in these S3 data lake / buckets and provision them into something like redshift or Amazon quick site for immediate use in things like Tableau or kibana. So that’s sort of an overall view of of the information and how we sort of manage this flow of data within Amazon and what I really like about the architecture is it completely flexible. So if you’re just going to use this for cataloging information as you move it into S3 Data lake, that’s one thing if you want to use us for more data transformation, data quality tokenization and encryption of information. We can integrate directly with Amazon EMR and all those processes to be able to provide the compute layer that’s needed to deliver this and then it’s immediately available in any of the ways. You may want to consume it from various Amazon services.
Let’s take a quick look at some of these customers that we have that are currently using the Zaloni platform to help manage their data catalog and their self service data platform in an AWS environment.
This first one is an insurance company based in the US and their whole goal was really to create a more personalized Insurance offerings basically driving a customer 360 initiative in the cloud. So they’re taking data and Their Sources so some of this is coming from places where they’re getting various AdWords integration with their Adobe omniture infrastructure pulling data in from internal systems as well on a daily basis on an hourly basis. They’re pulling information through a pipeline driving that to a sort of 360 degree view of customer information that then is feeding several of their analytic applications. It actually says here this part of the solution storage is Amazon S3 publishing is Amazon redshift, but that’s actually a mistake. In this case. The publishing is going to and elasticsearch environment that they’re then using Cabana and a couple of other capabilities to drive some of the reporting that they wanted to deliver. So they were doing things like adjusting information cataloging certainly was a big part of it and then they used a lot of workflow automation that we provided to pull the Source information and pull it through and create an updated views of customer on an ad hoc or an hourly basis. And if you look at the solution architecture, this is actually a little bit different than the one that I mentioned that was sort of the more common architecture. In this case. They were using our platform and they were using AWS primarily S3 is a storage infrastructure. But also they were using ec2 to drive the hdfs components. So they were building out a Hadoop data lake. So if you look from left to right here, there’s various different data services, Cloud based marketing information that they were driving they had internal CRM and ODS applications that were coming in and again there was this sort of polling event based architecture that’s pulling data from these various systems on a timed basis or an event basis and driving it through an FTP server where ultimately our platform would pull this information in and catalog it and then down below. There was the whole Adobe infrastructure and they were driving. Some identity management capabilities via Adobe that was also feeding into this customer 360 implementation. So as the information came in and our software in this case was deployed on ec2 instances and Amazon. The cluster compute was being driven by again sort of an ec2 driven cluster of hortonworks data platform. So we had set up various zones for raw refined and trusted information, but we had a transient Landing Zone, that was Amazon EFS and again that was so that the information once it comes in and certain processes are taken to obfuscate or encrypt or secure the information that raw data would get moved and dropped away as part of the EFS so that we weren’t proliferating that through the rest of the platform. They also had this really cool archival process where they were pulling data from the hdfs environment into S3 data lake from when archival perspective and the reason that they did that and they have this setup on a again on an event driven or timing bit driven process was that they only the information was only valid for certain periods of time. So it may be that some of the information that was coming from the cloud or from their CRM and ODS initiatives were only available for 30 days or maybe it was six months, or maybe I think the maximum end of it was five years and they wanted to be able to check the sort of overall lifecycle of the data on the timeliness of the information. They wanted to Archive it off of their core data platform into an archival solution and S3 data lake in this particular case. The other thing that was kind of interesting and this overall platform was they use the self-service interface for querying the catalog to publish data into A few different environments. So there was the serving layer store that was elasticsearch where they were creating these 300 customer 360 dashboards and Cabana, but they were also moving that information into some other applications that you think redshift was being used here as well. So this is a little bit different. I would say this is more the collect architecture than just the connect architecture. And what was critical was for them to be able to get all of this information from these Source systems catalog it make it available as part of this overall customer 360 use case
And then quickly. I know we’re running late on time here before we get to questions. The other use case was from a firm a suitable company also US based where they were just trying to reduce their overall time to Insight for the analytics users and their business users and they wanted them to have a more collaborative environment. So they were essentially cataloging data from lots of different applications both on-premise and in the cloud and they wanted to have a more govern self-service way to control who could see which assets were in the catalog and then ultimately which ones could provision information into Amazon redshift. So they were heavy heavy users of redshift in this particular case. And their architecture was very similar to the one that I originally showed which has ZALoni data platform cataloging data that was on premise. So they’re CRM and other applications sap and so forth. So they were cataloging information there. The zdp platform was run being run in easy to we’re always running sort of a persisted ec2 instance and we Used the EFS shared file system for things like our log files and other information that we’re sharing and as you’ll see that as the line points down through here. We were also using Amazon RDS my SQL instance as well as an elastic search instance to store our metadata and search capabilities. And what’s really cool about. This is the EMR instance for any of the compute. We were doing was totally transient so we can spin that up, Have that go away. We’re saving the metadata of the information. It’s ultimately being stored in S3 data lake or other storage devices. So people can still query our catalog see the information that’s available both on-premise. We’re in their cloud based environments, whether its redshift or at the data Lake and then we can go on the computer in a completely elastic and transient way. So this is a very powerful solution that we’re looking to replicate or many of our customers who are going more all in on the AWS solution solution architecture
And just to wrap up very quickly, you know, our platform is really an integrated self-service platform. We’re trying to accelerate time to incitement and reduce a lot of the complexities that go across managing these difficult cloud and on-premise environments and we’re trying to drive this whole concept of an able govern and engaged throughout our entire platform. So how do you catalog and ingest information? How do you govern it? And then ultimately how do you engage the business to have access to it in a way that helps them do their job more efficiently.
And with that, I will turn it back over to Ryan
[Ryan] Thank you very much for your time. Thank you very much for that. Very insightful information. So Ganjan has asked a question; AWS glue does data cataloging, How does ZALONI differ?
[Scott Gidley] Yeah, so I’ll take that question. So we have to ask that question asked ways been so glue is a relatively new catalog and capability from AWS and it does a fine job for information that’s stored in S3 data lake or other AWS applications. We have a little more capability to first and foremost Drive some more annotation and customization of the information that’s stored in the catalog. So we can enhance some of the information that’s stored in glue. We also have the ability here as I mentioned here to do some of the provisioning in a more self-service way than perhaps glue provides as we look to move forward. We kind of see our catalog being something that can consume other catalogs or publish our information into and catalogs that somebody wants to expand their glue usage as well.
[Ryan] We saw a few different solution architectures. Is Zaloni selling the product license, a consultancy services or all all the above.
[Scott] Zaloni sells product license. There’s certainly some enabling capabilities that we can provide to help build out your infrastructure if needed but we primarily sell the product and the customers implemented based on their need one area that we do work closely with our customers is that they want to add a new source or a new connection? Maybe we don’t provide out of the box. We can create custom connections or let’s say it’s a file that we don’t support natively that you want to really get more metadata from being able to create custom connectors a custom parsers for that data is something that we partner with our customers to build and then make it available in our product after the fact and in some cases. Our customers can build those on their own.
[Ryan] Do you support hierarchical data cataloging?
[Scott] So you can create hierarchies within the data that’s stored. So if let’s say you wanted to store something like a hierarchy of the organizational structure where data might reside so it’s under the marketing, you know, / retail / Business hierarchy, you can create that level of hierarchy within our metadata attributes, but we don’t allow you to do is sort of drag and drop the create hierarchy of relationships of entities across different environments that we don’t have that capability just yet.
[Ryan] Can you automatically remove permissions from users after you think they’re done with it?
[Scott] Yeah, we could so I guess the question would be how do we determine whether the user is done with it? So we associate users and ab role based access controls to the entities in the catalog. So those could be controlled or updated programmatically via a lot of different things. So let’s say there’s a time-based event where we don’t want users that access to information the catalog we could remove them from a project as part of a workflow that we could create automated says like you know After 60 days this this group of users no longer has access to this particular entity. We could definitely do those types of things if that’s if that’s what the question is.
[Ryan] I know you have this concept of the golden record, interesting to understand how you deal with various datasets that databases various file assets and know that you’re dealing with the same consumer record across many different elements of data.
[Scott] Yeah. So the golden records way we enable that is a via capability we have it’s called the data master extension, so it’s a spark machine. Driven data matching sort of entity resolution type of solution. So what we let you do is train the models based on your live data to get a level of accuracy of matching for whatever data type. It might be then when we deploy it and execute it on sort of multiple data sets. That’s where we sort of use under the covers. We’re using the power of a data Lake to bring these various datasets together and harmonize them in some ways so that we can apply the data across all of them and identify sort of the single record that’s going to identify them across the various clusters. There’s not a ton of magic there as far as how that data is brought together in the data Lake that’s sort of what we use our platform to do. So it’s not sort of a decentralized way to do it as you mentioned as part of the IPC.
[Ryan] How do you go about pricing Zaloni? Sure. So there’s a couple different mechanisms for pricing. So for delivering purely an AWS environment, there’s a core cost of the platform which is fixed and then if you’re using sort Elastic capability for processing data and you’re using us for any level of compute and there’s a compute hour charge that’s associated with it. So think of it as in order to get the platform up and running to populate your catalog initially, they’re sort of a fixed cost and then that could be here based on how much data you’re going to catalog and so forth. And then if you’re using us for that provisioning or transformation of the data, there will be a compute hour charge. We have sort of basic and ableman services for customers, which is generally four to eight weeks to get you up and running in a production level environment. And then some customers may have some specific types of implementations. We have different offerings for things like for instance. We have one’s called data lake in a box where based on the number of sources and number of use cases we will get your implementation up and running and in sort of hand over to you within a certain within a certain environment. This is for a lot of our customers who maybe are trying to dip their toe and in the data Lake pool and want to get that want to sort of stub their toe or have any of the headaches of the folks who sort of went through the initial implementations and they want us to help them go from start to finish for certain use case or for certain set of data sets and then handed over to them and let them manage it from there moving forward They’re all tiered, you know, if you’re interested in that just reach out to us and we can go through any of the different options we have from not only a licensing but an implementation perspective.
So I think you know certainly Financial Services is an area where we’ve made a lot of Headway from a customer perspective and we have also so when we have some accelerators from a file connectivity management perspective for things like, you know Health Care Whether it’s HL7. Then and some of the other different EDI formats that we support natively and then also from a Telco perspective. So I think Telco financial services and Healthcare sort of the three areas where we’ve cut our teeth.