May 18th, 2016
In Ovum’s upcoming Big Data Trends to Watch 2016 report, Tony Baer forecasts that data lake management will become a front-burner issue as early Hadoop adopters get to the point of production implementation.
During this fireside chat, Tony Baer and Scott Gidley, VP of Product Management at Zaloni assess the state of the industry regarding governance and data management tools, technologies, and practices that should fall into place as part of a data lake strategy.
Some of these trends are still accurate today. Are you ready to build data lake governance platform for better, faster analytics? See how Zaloni’s DataOps platform, Arena, can help in your custom demo!
Read the webinar transcript here:
[Kelly] All right. Good afternoon, everybody. Thank you so much for joining our webcast today on governing the data lake. So my name is Kelly Shuph on the Vice President of Marketing here at Zaloni, and I’m pleased to be here today with a special guest. There. Tony’s the principal analyst of Information Management at Ovum research. So what we’re planning to do today was more of a discussion format for a webinar rather than just presenting a set of slides, since Scott Gidley Zaloni’s Vice President of Product has joined us today, and he’s gonna talk with Tony about data lake governance. We’re expecting to go for about 30 minutes. And we’d also like you to feel free to participate. So, if you have any questions that you’d like us to address within the, the flow of the conversation. Feel free to go ahead and post your questions in the q&a chat box as we go. There’s no need to wait until the end. We will do our best to get to them within 30 minutes, and then we will do a little q&a after if there’s still some remaining questions. So, if you’re having any issues viewing the charts that I have right now. You should see one up here that that says meet today’s speakers. If you’re having any trouble doing that, go ahead and post that and Annie Bishop on our team will try to help get through them. So, before we get there though I need to do my, my 30 second advertisement so bear with me what I want to do is give you a really quick introduction to who we are and why we care about this topic and why we’re bringing this to the table. So Zaloni provides a data lake management platform for enterprises that are trying to transition to these next generation data architectures. And we work with enterprises to build production data lake implementations that are organized governed. They’re actionable. And our flagship product that helps us do this is called bedrock and it’s a fully integrated data lake management platform for big data can also see on this chart mica, we offer Micah as well and that’s a tool for self service data discovery curation and governance of the data lake. But mica alongside bedrock helps organizations do is provide more access for more controlled access to the data for their business and to users. Now what I’d like to do is move on to the topic at hand and the reason y’all are here so Tony I’m gonna start by a handing over to you to perhaps you could give us a quick introduction to your research report.
[Tony Baer] Okay, thank you, Kelly and I’d like to thank all of you for taking time out of your day to join us in this discussion, it’s an area that we’ve been getting actually here at Ovum. It’s been a sharp uptick in inquiries that we’ve gotten just this year alone on data lake, which shows me essentially that basically this is starting to really start to answer the agenda, we’ll go into that a little bit more but anyway. My background is I lead the big data and data management research area at ovum, and about five years ago, I just saw something very interesting happening, which was that, basically it was a major sort of tectonic shift in architectures, where basically we were dealing with scale in a very different way in the early internet era we dealt with scale really by trying to abstract all the logic you know in middle tier or n tier and we basically used the application to essentially cache data to make up for basically all the i o problems with databases. Mom was that in the meantime, as basically pioneers like Google, all those big internet companies find that to basically to produce better add in, you know, better ad placement or improved search indexes are what have you, that they really need to go through. Basically what would be considered obscene amounts of data and how to develop new strategies for dealing with that. And what we found over the past five years is that this is not only a problem for the Googles of the world it’s a problem for basically for enterprises that are trying to do things like for instance, improve on how they, you know, act on let’s say customer 360, or basically um, deal with on deal security. You know, you know, issues or risk mitigations and so on, or even just taking advantage of some of the new commodity scale our technology to lower the cost of computing of some very common tasks. So, in our report. What we really found this year is that data lake was really starting to enter the agenda. We go into that a few slides down from here but. On this page, these are basically the conclusions because we started with the conclusions and then we’ll go into kind of build up. But our conclusions from all this well the first one should not be surprising, which is that data lakes, you know, should be managed. The. I mean, there have been lots of management a lot of discussion about this, you know, you know, the issue is not new, and there have been a lot of terms used for what happens if you don’t manage the data like you end up you know just dumping data in without any concern for what’s going in there or the quality of it, or basically any types of concerns about, you know, who gets access, and the fact was that in the early days of big data adoption and adoption platforms like to do. It wasn’t such a big concern when we were just doing things, you know ingesting entities such as log files, but as basically the use cases for you know for Hadoop, and for big data, never have very much, very much evolved, we’re now seeing basically, you know, much greater variety of data, including data that comes from sources that we would consider to be sensitive. And so our conclusion was that, you know, for a number of reasons which we’ll go into further in is that, you know, if you have a data lake you better mentioned the jmeter know what’s in there, you better have control of the data that’s going in there, even have control over the process for turning that data into information. That’s just a very, that’s a real hundred thousand foot level, we’ll be discussing that, No, no, you know, with Scalia along with Scott in a lot more detail as to what that really means. On the next point which is that data lakes must have the capability to ingest all data and related metadata This is basically sort of a technical, sort of, I guess a definition of what a data lake should do, and basically in very simple terms, it basically can, you know, upskill ingest data that would be a superset of what you traditionally would have put into a data warehouse or data mart. And so the data lake is basically it’s it, ideally should be able to perform to act as a common repository for all forms of data that your organization may consider relevant new affordances for this next point which is that data lakes will succeed will only succeed. If they become shared resources. And that basically is is a reflection of the idea that a data lake if you’re going to now really expand the use case of your big data platform to become a data lake, it really becomes an enterprise system. And so therefore, the investment of time and resource in there will only be worthwhile. If you can expand your use of this platform beyond let’s say your classic power users, or if you’re lucky enough to have data scientists or our Python programmers and your staff know this needs to be accessible and won’t take to everyone but it needs to be accessible to a broad cross section of business end users, as well as your as the more traditional audience of data scientists. And, you know, the next point which is that, and this was a very interesting one, and we’ll be discussing Scott Nova discusses more much further detail is that there’s a basic change in governance when it comes to data lakes compared to say data warehouses data warehouses, it was basically it. You know was you know it and the technology folks are really in charge of basically of essentially making the data suitable meal for consumption in a data warehouse. This is all it is function, that is that is just not going to work in a data lake. Because the task is too vast into changing. And so this really needs to become a much more you’re just as the likes must become a shared resource, the governance has to become shared as well. So this is users must be prepared to take responsibility for curating data. And that may be you know built there may be some challenges getting there, which we’ll talk about some more. And then finally, last couple of points is that, you know, obviously, this should not be surprising as data lakes are relatively new, the maturity and readiness of basically the tools, the technologies and the best practices are still forming. So, and and from that the last point follows wish that the management of governance of data lakes should be considered to be a phased process, Rome was not built in a day. But anyway, having gone through all that so I’d like to just, you know, go back to you, and. And so you can get your just put some of your two cents in your vote, what’s your take on all this territory.
[Scott Gidley] Thanks. I think in general, we’re in violent agreement on most of the key findings here when we speak with our customers. We see that they have a lot of challenges around the overall governance and management of their data lake in their day Lake strategies, and a couple points I picked up on from the findings was regarding the ingest of all data and the related metadata. And I think that the thing we see there you mentioned it correctly. It’s a superset of the data that you would have had in the traditional data warehouse. So there might have already been governance policies and it’s pretty easy to capture data in traditional data warehouses which is more relational in nature. That’s something that’s been tried and true over the last 15 or 20, years. But there’s a lot of non relational information in the data lake that you’re trying to then correlate or coalesce with that information, and it may have no easy way to access or gain the metadata that you need so that you can govern all of the data and you don’t get these dark spots within your data lake where there’s some data you have metadata about and there’s other database that you have no idea, or there’s other data you have no idea what the metadata is or if it even exists. And so our approach is to capture metadata on all data is upon ingest, and that could be business metadata technical metadata about what type of file maybe it is, as well as operational metadata how often it gets updated, who has access to it and so forth. So by building these, these rich metadata catalogs. That’s really the first step in governing the overall data lake. The other thing I think that really rings true is the readiness of the tools and the technologies and best practices. I think that in the open source market, if you’re building your data lake on top of Hadoop, the tooling is loosely coupled at best and a lot of it’s not gonna mature and as it matures, we need to think about how that fits into the overall governance process. And I think the phased approach of that makes a lot of sense and we’ll talk about this later on, on one of my slides when we look at an overall architecture for a data lake that we see a lot of our customers breaking it up by zones and applying governance policies on the various zones. So it may be initially there’s a raw zone where all the data goes and maybe only a select few folks within the IT group have access to it. But then as it gets moved from say the raw zone into a refined zone, there’s additional policies that are placed on top of the data and that allows you to kind of manage it in a more phased approach. Now we can go to the next slide.
[Tony Baer] And actually about this one point which I meant to add there I think, Scott you raised a very interesting point about basically that the knowledge about the day that goes in there was is all over the map and I think maybe we can sort of use that, that, I guess, you know, that those terms that were used with with Neil with, with, you know, earlier on of like known knowns and known unknowns and unknown, and so on and so forth that is basically going to be a challenge when basically what to do, like when you do widen out the, you know, basically why not the scope of data that you ingest. But anyway, we have a couple slides here which really talks about stages of a word data lake, you know fits into the overall and I don’t want to call it a maturity model but kind of like maturity cycle, you know in Hadoop adoption. And what we see is that basically it’s not, it should not be the first thing, because and and that comes back to the very sort of definition of a data lake which is that it really isn’t. At that point, is where, where this where this data repository becomes an enterprise resource you don’t start with basically an enterprise system on day one you build it up by pieces, and you build it up basically based on your knowledge and based on your experience with each of these use cases. And typically, what’s happened with a dupe adoption is that with the data adoption is that it really starts, and it’s not that different from a lot of other enterprise systems, which is it starts very often as a group effort and then sort of gradually spreads to where let’s say some other departments, some other line of business units, start to take advantage of them from there it matures or evolves to enterprise. And typically just some of the just some of the types of workloads that we’ve seen at those stages, is it starts with may start with something as basic as log analytics or sentiment analysis, I mean, those are some of the earliest uses that we saw come in at firms, but actually what’s really interesting on the enterprise side it took you know adoption cases often took very different paths. It started with, basically the offloading of some cycles, you know, whether it be compute cycles or storage. From the data warehouse, one of the most in fact one of the earliest ones I’ve seen and it’s a very common pattern is where you’re doing ETL offload because the idea is why basically spam would see your Oracle or our Hana or, or, or Teradata footprint by performing ETL. And so a lot of those site, you know, there has a lot of for a lot of organizations doing that data warehouse offload for things like ETL or for certain types of very long running batch queries has been a fairly common, you know, pattern that we’ve seen of adoption and it makes a lot of sense to think about it, because it’s a good way. It’s relatively low risk way of basically getting your getting your feet wet, getting experienced with this new platform and using it for use cases where there is a tangible ROI. But as we said, but then as as organizations, organizations start to start to get success from some of the earliest you know implementations such as Neo we’ve saved. We basically reduced our, the growth of our Tara data footprint by say 50% over the past year because we we offloaded cycles and storage from it. Now the departments will then start to get interested and that’s where you start to see workloads like exploratory analytics, which initially will probably be more led by say you’re more more highly skilled or more sophisticated users and those are the folks that will be running what’s the your are your Python programs. And basically, it’s basically a process of like searching for what’s the right question to ask searching for the signals and what are the and where the signals leading you. And then when it gets to multi department you’ll then start to see some line of business analytic applications and that’s where you start to see this end users start to be folded in and then operational analytics especially so we start incorporating some of the some of the high performance processing engines such as spark we’re now basically um analytics can start to be you know can be employed in kind of closed loop fashion with processing of transactions. And then basically just need the ultimate stage of adoption the endgame here becomes a becomes a data lake where basically you say okay we’ve now gotten. You know, we’ve now gotten successful take up from let’s say a handful of organization of line organizations. Let’s see if we can broaden this to be an enterprise resource, then we can we go to the next slide.
[Scott Gidley] So Tony I did want to make one quick comment. Before we move on, I think that this slide really explains how the adoption is is expanded within an organization and I thought I’d throw in a couple of actual customer case studies or, in some examples. And so one would be one of our bigger customers started with the need to offload and or data warehousing modernization right they were really focused on, not so much, reducing their Tara data or Oracle footprint whatever the warehouse might be, but they were trying to create new and interesting analytic data sets of data that maybe wasn’t easily created or managed in a more relational world so they could use the data lake to mash up data from various unstructured formats and create new analytic data sets that they could then put into the warehouse and use use for that purpose. From the offload perspective they found that they were using too much of their Tara data instance for ELT types of processing upwards of 90%. And so by doing more of the ETL offloading the data let’s make it at a much cheaper cost they could use their warehouse instance for what what his main purpose was so you know I think there’s some real value there, and then they expanded, is that that was a successful project they expanded into turning. What they ended up calling their data lake. The concept of a metadata catalog that anyone could search within the organization to see what new data sets might be available that they would want. And they could make a request to it to have those provisioned into some other type of data source that they can consume for their own applications. But I think the number one thing that prevents this maturity curve from happening is the governance, right from the majority of our customers that we start with, it’s almost their their data lake or their Hadoop instance is always in that sandbox discovery environment that has very few people who have access to it, and they’re worried about the governance processes to, to open it up to much larger enterprise use so I think that that’s something that I know we’ll talk about later but as you see that experience to where you’re getting to data lake is a 360-degree view of all of your enterprise data governance is a gating factor for them,
[Tony Baer] you know, no doubt about that. And the thing is that a good, maybe a good sort of example of this is that we’ve looked at basically the evolution of the Internet, and the evolution of email and nobody that when basically when email was first invented is when the internet was just used by basically or by read by the research community but everybody knew each other. And so nobody ever thought to put in those original email standards of, you know, basically prevention of basically people spoofing their own IP addresses or email addresses. It was never because that was it was in a small group he didn’t need that dominance but then when it became you know the internet became a mass consumer platform and, oops, we forgot that and so governance is just going to be essential when you when you, when you broaden use because basically the you know the rules of engagement, which will probably be like, you know, what should be in transit will to be considered as basically implicit for early users will need to be made explicit when you basically draw that populate you know throw that user population out so you know Scott couldn’t. Couldn’t agree with more with that, can we can we go next slide.
[Scott Gidley] As you see, organizations, you’re speaking with where are they on the level of maturity from a pure data lake perspective, are they still in the trying to figure out how to move from that sandbox environment on are you seeing them start to cross the chasm to more mature or is it all over the map I’m just curious.
[Tony Baer]I would say basically with very few exceptions were really it’s still that figuring out how to scale up and scale from the sandbox. I mean yes you have maybe a handful of early adopters that have gotten to the point where like they said like, oh we’ve figured this out. I would say I could probably count those on probably a fingers of one hand and still have you left over. So no I mean basically we’re still I mean, we’re still at the point of most organizations still the point of figuring out what what role is a data lake going to play. And how is it and what is its role alongside the other data warehouse which by the way is not going away and they like does not replace it. And so I think there’s, we’re still, I mean, we’re still very early stages on this, and in fact that you folks like you basically, it feels like you over and Zaloni have like a really have a potentially a very important role to play here to kind of make this to make this sort of veiled navigation of governance a lot less you know, mysterious a lot less of a black box.
We go to the next slide. Okay, anyway. So this is basically, and you’re gonna you’re gonna be saying you posted missing a couple different takes on reference architectures. The next few slides. This our take on on essentially reference architecture from a governance standpoint. And we’ve kind of color coded, which is that the, the areas that are in pink and are basically on an area that are basically at a right angle are basically capabilities that you would have on any data platform because there’s some things you’re going to have regardless of what you’re using this New York, your repository or your new platform as as a data lake or not. And, you know, that would include for instance perimeter security, even if it’s not used as a data lake. You want to make sure that there there’s protection on the outside against bad people coming in and doing bad things, or the wrong people coming in. And then of course it’s obviously like with any large data platform you’re going to have need to monitor and control so you’re going to have to have your basically your systems console there, so you can make sure that the thing is operating properly. And of course it’s possible that there was the housekeeping of availability and reliability, which could include like fault tolerance, high availability backup disaster recovery. Those are not things, you know, those maybe were not necessarily considerations early on with the dupe when there was like single use cases. But even before you get to data lake is like he didn’t know now that we’re starting to run some more mission critical processes, there are problems there. This is critical problem, you know, that, that becomes a factor you can’t ignore that and that has said, that is, you need to have those features in place all those capabilities those disciplines in place perimeter security monitoring availability reliability, you need to have that in place before you even think about putting up data like then basically you know you know cutting to the chase here so what is the day like well you have the data platform itself. That’s the, you know, obviously the lower level above that. And this is an area where basically we’re still we’re basically, we’re still at very early days is the whole idea of cost optimization. This frankly when you start turning this into a shared resource that start prioritizing things not everybody’s going to be able to get a meal, you know basically priority one access, because it’s going to be a lot of contention especially given the variety of workloads between, between the real time iOS, let’s say streaming workloads and batch, and they’re going to be both coexisting side by side because modeling itself is not going to be real time workloads never has and never will be. But on the other hand, figuring how to apply those models and then a good example of that is like real time credit scoring where you where you where you’re going on a site and you fill in a couple, you know, filling something, you know, fill in some basic demographic information about yourself and some basic financial information all of a sudden you get your credit, you know, you get okay this is how much credit we’re going to give you it’s not that they figured out figure that on the spot, they were trying. They’re basically trying to match your profile against a bunch of pre computed profiles which are done offline, you’ll have that, and you’ll have the same thing in the data like also integration is very important. and we include that in the same layer, layer is optimization not because they have to do with each other but it’s more they just want to keep didn’t want to make this diagram, you know overly complex but it’s another system level function or capability because the fact is a data lake is not going to exist in its own island is going to exist alongside other platforms where it’s accepting data from them for instance it might be, you might be getting an offload from the data warehouse. And, or if the data, you might be basically refining data in the data lake and decide to production analyze it, you may need some of that refined data, you know, to the data warehouse so you’re going to have that integration with all the other data platforms that are gonna exist in your, in your environment. Then a top that is data level security and that complements the perimeter security that you already should have a data level is where we’re perimeter is looking at keeping no unauthorized users out. Data level security is figuring out okay how do we know what type of, you know what type of access do we provide to this data what type of protection do we provide to this data, whether it be, do we display this data as is nor do we need to mask or, or encrypted or hash it in some way, shape or form and for which class of users and so that is definitely data level security is also a very important, very important function. Then at the top is where we start getting towards what we call the Self Service tier. And the point that we’re gonna be focusing on is that next a top level which we call data in which we call inventory data inventory that basically includes a couple of processes which we’ll go into in more detail later on, but one is the physical inventory. We’re basically trying to figure out okay what data, what data is in there and what types of basically on of measures do we apply with regard to data retention with regard to security access and so on. Data curation is where basically end users are essentially are curating their own information, these are the data sets that we need. Those are the areas that we’re gonna be focused, focusing on very heavily in this talk then above that is basically is, is the application layers the query is where you’re doing the querying. You know the analytics tools the analytics programs. And as I said, we’ve kind of grouped that towards the top towards what we call a self service here because basically. Unlike a data warehouse where you basically have a curate you’re basically a walled garden of data and just a handful of things or a finite, you know, range of queries that you can run against it in a data lake basically all the boundaries are off so that so self service is going to be really essential for a number of reasons. Scott what’s, what’s your take on this.
[Scott Gidley] Yeah, so I think that we see the same types of things and as we transition to the next slide we will kind of go through how we fit into that that architecture are how we see organizations fitting in with that architecture and when we look at the data lake or when I see our customers looking to start out their data lake journey. The first part of it would be when what you call the data platform, how are they building and managing the rate of change within the Hadoop ecosystem and the skills gap that that that creates so I think this has gotten much better over the last couple of years, they do distributions have matured. They’re including more tooling for monitoring and upkeep of the of the Hadoop clusters themselves, but there’s still the skill gap you know traditionally the, the data warehousing bi tooling was very SQL driven and application driven and now you’re having folks have to learn things like MapReduce and spark and newsy and you go through the whole alphabet soup of the different pieces of the open source landscape, and that skills gap creates these unicorn of individuals who are hard to find and even harder to keep an organization so they’re looking for tooling to smooth out some of this process. We also see and I’d be interested to get your feedback here Tony
[Tony Baer] the complexity of environments that are mixed between cloud and on premise that hybrid environment for their data lake is also cause some, some struggles with our customers but that’s sort of managing that building platform which would be sort of the bottom a couple of layers of your architecture. Right. Number one, that’s Scott that’s basically, this is not confined to data. Basically I mean the whole rationalization of. Basically, the workloads, and where and when and where you deploy data to cloud on premise and making that seamless, is a very general is a very general issue which of course they like they’ll have that as well. Kind of like going back to the previous diagram, that would be something that would be an issue that would be I guess I would call it in the in the in that pink area which is this is a very general issue, and you’re only just starting to see features that are coming in I mean from, from the database folks for instance you know Michael Microsoft SQL Server 16. You basically have a stretch capability Oracle has its capability for going to the you know from, from Oracle on premise to the Oracle public cloud and so on. Those are features that you’re starting to see implement within vendor stacks but generically basically trying to make you know make this transition, which will you know basically you know which, you know, being in basically a more heterogeneous environment is, is gonna be quite a challenge and basically. So we still have there. That’s gonna be challenge that’s gonna be with us for a while, Data lake
[Scott Gidley] yeah and I would just echo that I mean, from the data lakes Pacific perspective. In the last quarter, I would say, a lot more than 50% of our customers or prospects have been either pure cloud or hybrid cloud environments so it’s something that needs to be addressed. From a data lake perspective sooner than later.
[Tony Baer] Yeah, one question I’ll have for you, for you folks is basically that, you know, over half are basically are dealing with cloud in some way shape or form. Does it seem to be trending towards hybrid or, or towards where basically, you know, this data lake is so complex we’re gonna put the whole thing in the cloud.
[Scott Gidley] I think it depends. I think I’ve seen more hybrid than full cloud, but then the one, the ones that tend to be full cloud or then you’re a doctor. Probably the younger companies that have less on premise art infrastructure to begin with. And they’re starting everything from Cloud and the cloud perspective. No. That makes sense. So as we look from the, the building which set the building of the data lake and the infrastructure components there which seems to be stabilizing the next step is the managed ingestion how do we get data in how do we capture rich metadata about the information so that we can have clear visibility, we can address the privacy and compliance issues that keep these from becoming more production level instances in that, I think also getting better and this is certainly an area where Zaloni spend a lot of focus. And as you build this manage data lake which has a rich metadata catalog, then you can start to deliver on the value of enabling high quality information and reusable processes I think in your previous slide, you mentioned the whole concept of how do we operationalize this as workloads change as you go enable it to more folks, being able to operationalize the updating of data the Delta change within the ecosystem. These are all challenges that are different from the way they work in the traditional data warehousing space. So, these are some of the challenges that that folks are having and as we look to the different types of solutions that are out there you want to enable as much data to be ingested organizing catalogs as possible within the data lake. We want to be able to govern that information cleansing, you know, having spent a lot of my past lives in the data quality arena. I always think it’s the deja vu all over again we worry so much about index data architecture, how are we going to get data in how are we going to access it, the concepts of governance and cleansing and quality of information kind of are always second tier, but they’re they’re bubbling up again here as well. And then lastly you know how can we engage the business we want to enable this access and engagement, more of the data, data analysts data stewards business users the data scientists, they should all be able to collaborate. They should be able to discover the data that’s in the lake without having to know how to query h cataloguer or the hive server and then how can they enrich that information and provision it back out to other analytic applications other operational applications, whatever it might be. And that’s really where I think we still have a ways to go. In the whole data lake journey.
[Tony Baer] Now no question about that. I just want to seize on one of your points here and it’s gonna segue to basically a point that you make in upcoming slides. We lay out your own reference architecture, which is the whole notion of data quality of cleansing. And the fact is that in a data warehouse, it was a relatively cutting, at least conceptually, it was cut and dry which is basically it was that the data where the data was either dirty or it was cleansed, to a specific you know canonical form. Whereas in a data lake, it’s gonna be much different creature and the fact is that we did some research actually it’s now coming up on about four or five years ago, we looked down at basically is how, how is data cleansing inequality, going to apply in the world of big data and our conclusion was that this, it was going to have to evolve. And it’s not that the old approaches were wrong. But it’s really more of kind of concentric circle in terms of that there’s some data you’re going to be using for exploratory analytics. This is where you’re basically trying to define the problem, or you’re not making any types of operational decisions to that point, auditable decisions. So they’re just trying to get a sense of where should we be looking at that point, then equality is a much slower, you know, it’s still important but not at ease, but you don’t need to get it to basically that they said that fine level of precision that you would need in a data warehouse we have to where we are making real decisions that affect real business processes that may in some cases have no real let’s say your regulatory compliance implications. And so, and you hit that this basically you know in your slides that they’re gonna be different gradations of data actually we had another way of putting it, which is kind of like a bullseye where basically you’re operating on, you know, a different groupings of data on which you have different confidence levels in terms of its degree of quality uniqueness and so on and you’re making certain types of let’s say judgments based on you know i mean the types of judges you’ll make on data that you have a say 30%, you know confidence level and it’s going to be very different from the decisions you make on data that you haven’t seen 90% confidence level. And so part of governance in the data lake is managing those, you know, managing basically those data quality processes and managing also basically you know the retention this deal the storage of those there are different categories of data and in turn, there may be different new rules for engagement for instance different rules for access depending on how clean that data is so as I said it’s something which you basically would you post basically hint that you know what your next slide so I kind of just wanted to throw that out there little monkey wrench in there.
[Kelly] that also comes from your upcoming report and I was wondering if you could describe this one a little bit for us and also maybe double click on two concepts in particularly here self service and also collaboration.
[Tony Baer] This is basically if you recall back to our reference architecture. This was in the data inventory area and I told you we were going to be double clicking down on this. And what we found were basically two broad groups of functions and two port groups of roles here that there’s a role for end users and but there’s also a very necessary role for it, it doesn’t. In other words, when we start talking about self service in this area. This doesn’t mean that it normal, your advocate dysfunction, very much the opposite they’re kind of like they’re playing kind of an overspill, you know, sort of an overseeing type role. More specifically, when we found that breaking down is basically it’s into two streams once we call data curation, which is more the you know the end user the business analyst and the responsibility of business and analytics teams and then on the technology side was the physical inventory. Now first looking at data curation it’s where you as the end user or you and your colleagues are building your own libraries of information, because you’re not to be as simple we’re not you’re not gonna be able to rely on it. To do this, it’s just not going to be using the resource for that given the fact that you’re gonna be looking at lots of different sets of information and trying to do different things. And so in this area. This includes a number of related tasks includes basically profiling the data, doing data preparation which basically would in the old world would have been called data cleansing that’s a principle, but it’s a different. It’s a different type of creature when you’re in a data lake because basically you’re taking different approaches you’re taking approaches that they say are probably more probabilistic taking approaches that basically are that may have some machine learning in fact we believe that machine learning is really gonna be essential here because it’s basically the system needs to give you an assist in making sense of how this data fits together. But as part of all this what’s really important here what you cannot do without just collaboration because chances are no single group, or no single person is going to know, have no unique knowledge about all the data given the vast variety of data, that’s going to reside in a data lake. And so here you’re going to, you’re going to need to have a way of sharing this insight I don’t know between what saved me or what what in the overworld been called power users to say you know we know something about the source of this data. I’ve used this in my you know it’s useful for these types of problems. And these are the types of issues that you may have when you when you’re basically trying to prepare this data. And so, you know, and. Or here’s some of the notes or here, or here’s basically a library of routines that we use that were that were successful in preparing this that here’s a way to approach it. And so, that collaborate you know that collaborative data enrichment is gonna be really essential ought to be essential basically in basically enriching the metadata, out of the metadata comes a cataloguing function that comes basically and again, you may not be dude, Rome was not built in a day Don’t expect that when you curate data when doing all these but ultimately this is what it’s all gonna involve, you know, down the line is coming up probably looking at matching data and trying to, you know, you know, identify where the duplications are. And from that, you may then start to derive your, you know what is going to be a canonical record of, let’s say, a customer in this new environment where we’re ingesting all these new different data sources that go beyond name, your address or account information. And in turn, at this step. Every, every, you know, every step along the way, the lineage has to be recorded in terms of what was done the data. And, you know, and who did it. And so that’s what I am. So that’s where, as I said before, collaboration is just gonna be just, you know, collaboration and machine learning are gonna be are gonna be huge here we’ll go into machine learning a little bit more in a moment but. Now on the other side which is the technology, you know, which is, you know, for the, for the IT folks is that the physical inventory which is knowing and managing what data is in the data like that’s not an end user function this is ultimately basically, the person the other people that are that are responsible for maintaining the Hadoop cluster. And this is where your managed. And this is basically where they have to manage the data access now I have to tell you that, you know, and tag for security now I have to say that this is an area where basically we’re business users or, you know, or power users can contribute as part of the collaborative data enrichment, and share their insight in terms of we believe this data is going to be sensitive. We believe that certain categories of users should should or should not get access to this data is for business users to basically to make that clear during collaborative data enrichment during the curation phase. It’s for it to make sure it is it is enforced or that there are rules there when there’s the physical inventory because, you know, they want to, it’s ultimately it is ultimately going to be accountable for basically who can access this data, not the end users, not, not the business teams may also need to set and deal with data retention and this is an area which basically has not been, you know, with most organizations still early in their adoption stage, you know of Hadoop and big data. We’re not really thinking about life cycle, but, you know, but well basically, you know, players like Facebook or Google, you know, are not good. Don’t archive data or at least are not really, you know, are not archiving at this point they might be doing, you know, tearing. The fact is is that you as an organization your resources are not gonna be infinite. And so you’re gonna they’re gonna have some decisions in terms of, you know, you know, what didn’t know how long this data should stay resonant in the data lake is you can’t lie stores may be cheap, but a lot of cheap storage gets expensive. And in turn, so that’s part of basically managing the lifecycle and workflow. And this is the area where you need to track data lineage. Now the thing is, that’s going to be a very interesting challenge because basically, you’re gonna have lots of tools that are being concerned to contribute their own lineage data you’re gonna be getting it from, from the, from the data lake management, you know folks like the Maloney’s of the world, you’ll probably also get some of that you know from analytic tools that use the data, and so on and so forth, and you may get some of it also from the, from the ADO platform. So, this you know this lineage is going to have to be rationalized. So you don’t have 15 different you know sources of truth in terms of what what’s happened to this data. Scott what’s your take on this.
[Scott Gidley] So I love this slide. And I think that the things that I would like to see you add or we can chat about here is the idea of automation you mentioned machine learning. I think the automation between the curation physical inventory is really critical because as we look to use machine learning techniques or other analytic techniques to determine to make recommendations on how people should manage data as it comes in, and the curation process and how that feeds back into a total data Lifecycle Management loop are questions and requests we’re seeing from customers now. And the example I’ll use is data retention and Lifecycle Management, we are starting to see that from some of our more advanced customers where let’s say you know it’s a retailer and they’re trying to manage inventory and the data only has relevancy for a certain period of time, and they’re capturing petabytes of information over a week into store that as a, as you mentioned, you know there’s commodity hardware and it’s cheaper storage but it’s not free. And, and ultimately it destroys the integrity of your of your data lake as you keep assets around that aren’t really useful. So one of the things that we’re working with customers to do is to build a data lifecycle management policy around where data lives in the data lake, how relevant it is what kind of storage tier does it need to be engaged with. And then, at what point is maybe that data get moved into an object store like s3 or blob storage on Amazon Azure, where they may want to keep it forever because at some point at the end of the year three years from now, they may have to pull them back but they don’t want to keep it in their active data lake and managing the lineage, and the lineage of that is really important, as you mentioned, so I think automation is sort of going to be the next phase right right now we’re still in the blocking and tackling of how do we build these data lakes, how do we manage and govern them properly automation is going to be the key for their next step into the future.
[Tony Baer] Yeah, I think we’ll all be interested in looking at how folks like you new players let you know that we’re tackling these problems, we’ll be dealing with that lifecycle because that was never actually dealt with very well in the conventional and in the new LTP environmental data warehousing environment. Now we need to do that in the data lake so be very interested to see, you know, the approach that you folks come up with.
[Scott Gidley] Great. So the last slide here that we’re going to discuss, just a typical reference architecture, and as we look at zaloni customers and when we discuss how they’re going to build a data lake how they’re going to organize it, and ultimately how they’re going to manage the information within it, we often use this as a talking point. And so you know source systems from all different places, whether they be file databases ETL extract streaming is a big thing we’re seeing now so for the Internet of Things and data streaming in from social media and other cloud based sources. That’s important as any other source within the organization. And we break things up into the concept of them so we’ll have a loading zone which may be inside your cluster and your data lake or maybe it’s outside of it, where the data gets landed for maybe some initial ingest preparation activities performed on it, and ultimately comes into a raw data zone that maybe has a very specific set of users they can have access to the information there. It’s all the data in its original unaltered format. And then we can tokenize information or privatized information as needed before it gets moved into let’s say the refined or trusted data zone. And there’s various levels and policies and definitions of what it means to be refined trusted or potentially turned into a discovery sandbox or an analytic down. And these are all names that we use in this reference architecture our customers may have things broken out by the type of data, etc. But I think the key is defining the policies for what it means for data to be in this particular zone, and who has access to that information within them. And then on the consumption side is who can connect to that information using something like self service data preparation tools, whether that be is aloni tool or some other data preparation application. This is also where we can manage policies across zones on the relevancy of the information and the data Lifecycle Management so that’s, again, one of the things we’re focusing on is how do we determine when data moves from one zone to another based on automated processing rather than rather than it needing to be promoted by somebody manually. Again, I think, automation of all of these processes as much as we can get through them is going to be really critical moving forward because the rate of change of data sources on the left is going to continue to grow and grow. So, I just we just wanted to kind of bring this up as an opportunity to say these are the ways that you can think about your reference architecture, and that can help lead the governance policies for the rest of the data lake. I think you’ve laid out some very interesting concepts here and again, as I mentioned before, just, just focusing on just on the, on the different zones of data, you will find data trusted data discovery sandbox, no the raw data.
[Tony Baer] And these are issues that we really did not deal with in data warehouses we just assumed that all the data there was, was refined, and it was for specific uses. They like no not really the case. And so therefore, this is a new type of thing where we’re going to have to figure out how to manage diversity in the in the in the in the forms of data and how it’s used so very important, you know, very important basically whitespace for governance to really fill in there.
[Kelly] So Tony and Scott, thank you very much for this. I wanted to give just a few moments to the audience to see if they wanted to jump in with any q&a. If you do have any questions for either Scott or Tony if you want to type those into the q&a chat box we’d be happy to address those. And while I give them 30 seconds to see if they have any questions, I do want to ask you guys one general question, which is, as you know as we listen through all this, it seems like it seems like the phrase, the more things change, the more they, they say the same comes to mind for me, that I know, Tony Do you have a few thoughts on on how data management has changed over your career. You’ve been right at the space for a little while.
[Tony Baer] Now, I do see that, basically, you know that metaphor just very much applies and what I think back is basically when I was curbing the emergence of data warehousing and bi 20 years ago basically almost exactly 20 years ago. And I remember there was, you know, and in, you know, at the time, they say, you know, data warehouses and SQL databases were considered the big data of their day because this is the first time that data was, you know, in a data store and it was independent of the application and result well we can now get a bigger view and now that we had basically at that time, client server then we started to go you know towards the internet, you know, there’s a web based architecture, but, you know, but with the, with the rich, you know with the rich clients. We can now do more than green screen reports. And so I remember going through what you know go into those a lot of those early TD Wi Fi, you know, no sessions. And there’s this one guy in the back of the room and he was awfully obnoxious. You’re saying, Do you realize, you know, did you realize what the quality of the data is that you’re feeding in there. And the fact is that was not really an issue that we really thought about before when we had individual issues you know systems where you know they each have their idiosyncrasies and we had workarounds but when you put those all together and you had all the. Now you had basically data that had no doubt that it had basically sort of I guess sort of varying levels of sparsity or varying patterns of sparsity. And so, that guy was, he was obnoxious he was basically saying the emperor has no clothes on, but he was right. And so we’re kind of in, it’ll hopefully it’ll list the punches we were kind of in a similar position today in terms of that, when you look at how basically Big Data platforms or Hadoop was originally used we put in lapaz ever thought they clean them up we never thought to basically restrict access on them because basically you know how sensitive is that data and who’s gonna really use it outside of a few trusted data scientists Well, guess what today basically have a much greater variety of data and guess what some of those log files are going to be are going to be sensitive. I’m actually participating in an IoT cyber security panel tomorrow night here in New in New York, where I’m based. And we’re just looking at for instance you know connected cars there already have been instances of. There was an incident of a well publicized instance of I think a Jeep was basically it was a controlled experiment, which, you know, and this was not even a self driving car was a Jeep, and its software was hacked. And so the fact is is that we have to take another look at this data that’s coming that we didn’t think was necessarily sensitive. It’s gonna be a lot more sensitive there so basically I think the issues of sensitivity quality, are things that are gonna be, it’s you know it’s kind of like deja vu all over again except this old lot, much more of a complicated problem this time around.
[Scott Gidley] And I would just echo I think it is the same problem all over again but we have different tooling, the technologies have advanced to the point where we can handle them in a different way, if I look at what we, you know, being someone who helped build out some of the early stage data quality and data integration tooling. Having the in memory capabilities, ability to ingest data at different rates, the different in memory processing architectures. Now we can we can hopefully learn from how it was done before and do it in ways that improve not only on the speed of what we can do it but also the results.
[Kelly] All right, thanks guys hey I’ve got one question here that I want to pass I’m gonna hand this over to Scott actually because it’s very specific to our solani data lake reference architecture, and that question Scott is what is the transient loading zome that we have no architecture.
[Scott Gidley] Sure. So, this represents a place that may be outside of the data lake where data that’s being ingested initially may land in a very transient way to have some different types of processes run against it so maybe you need to remove some sensitive data or maybe data needs to be transformed in some way before picking come into the actual data lake and what we would call the raw data zone which is usually data in its native format. So it’s not always used but in some cases we found people want processing to occur outside ability before they can get in and we call that in the architecture the transient loading zone so it’s more just the name that we use in the architecture, but it’s it’s a way to handle sensitive data outside of the daylight.
[Kelly] Perfect. Thank you very much. Well, I don’t have any other questions. So what I’d like to do is go ahead and close this down, thanks so much Tony and I know we’re looking forward to seeing your report published and further work throughout the year on this topic, it’s very interesting to us. Okay and to the rest of you, thanks so much for participating today, we’re going to follow up with you via email with a link to the recording of this webcast and we’re also going to send you a link to the charts. And please feel free to reach out to us if you have any other questions or you want to continue this conversation. And if there’s anything we can do to help you modernize your data lake architecture. Have a great rest of the day. Bye bye.