January 16th, 2019
How does your organization collaborate with data? Aligning data management tasks across any size organization can be a challenge. This can be attributed to a lack of transparent data access, lack of big data skills, or antiquated toolsets that do not enable shared metadata for clear lineage of the data. Regardless of the reason, the results are slow, rigid decision-making processes.
While modernizing your data architecture for more agility can seem overwhelming, with an integrated platform that enhances collaboration, organizations can reap the benefits of quality data that is well understood. The data platform should provide users with the ability to fully understand all aspects of the data with a simple, unified user interface where the business and IT can define, transform and provision the data. All while providing right-sized governance for access, security and auditability.
Join Clark Bradley, Solutions Engineer with Zaloni, as he tackles modernizing your data platform and explains how your organization can expand collaborative practices with the Zaloni Data Platform.
By the end of the presentation, you’ll be able to answer these questions:
– Why is a data catalog important?
– What do I need to know about data quality?
– How does self-service play a role in the data strategy?
Ready for a modern data platform? Get started with a demo of Zaloni Arena!
Read the webinar transcript here:
[Brett Carpenter] Hello everyone, and thank you for joining today’s webinar. The top three considerations for modernizing your Data Platform. My name is Brett Carpenter, and I’m a marketing strategist as Zaloni. I’ll be your emcee for this webcast. Our speaker today will be Clark Bradley. Zaloni’s solutions engineers. Now will have time to answer your questions at the end of the presentation. So don’t hesitate to ask them at any point using the Ask a question box located just below the player window. Then you’ll also notice we’ll have a few polls going throughout the webcast. That’s when you use the vote tab, that’s located just under the player window as well to participate.
Now before we dive in though I’d like to provide a brief introduction to who we are. Zaloni simplifies big data. We help customers modernize their data architecture and operationalize their data lakes to incorporate data into everyday business practices. We supply this Zaloni modern data platform, which provides comprehensive data management governance and self-service capabilities. We also provide professional services and solutions that help you get your big data projects, up and running fast. With that, I’ll turn it over to Clark, as he discusses what needed to achieve a modern data platform. Clark.
[Clark Bradley] Thank you, Brett. Hey everybody, thanks for joining us here today. And what we want to talk about are the top three considerations for modernizing your data platform. So what we’re going to talk about is being able to discover and understand your data validating, and governing data quality and driving efficiency by the inclusion of a self-service, as part of your data strategy. So let’s kick it off by talking about the data lake and defining what is the data lake. So the term data lake came out in about late 2010 as a result of the restrictive nature of data Mart’ss. So whereas data Mart’s hold well packaged and structured data user needs we’re evolving around that time, Hadoop did come out, just a few years earlier so big data was all the rage and big data, generally, is around three different criteria, sometimes four considered like the three or four V’s. And so that’s volume. And that’s the size of the data being ingested into the system. A lot of times, driven by social and IoT and it’s also the size of data being accessed for processing. There’s velocity, which is both the frequency of the data arriving into the system as well as the speed of processing needed. Variety is the third V, and it describes all the different formats of data that we’re working with both structured unstructured, and even semi-structured data. The fourth V that comes up quite a bit is veracity, and that defines the reliability or the trust that an organization has in the data. Is this a source of data that we can trust?
So, to simplify it down. Many different formats of data. Numerous data processing frameworks so lots of things to choose from, from SQL and no SQL and batch and streaming processing. And now more and more. We’re seeing the inclusion of cloud environments into that environment into the modern data platform. So, it can be on-premise. It can be off-premise in a public or private cloud, or even some hybrid construct that falls in between. So what would a data lake discussion be without talking about the data swamp. So the converse term here data swamp really talks about how. And this is a common problem, where data lake projects, generally missing like sufficient data governance, or data management type processing in the environment can fall into a state where the data is hard to use it’s hard to process, it’s almost impossible for users to access it. And so there’s a couple of different things that can promote a data swamp. Uncontrolled consumption is the ingestion of just a variety of different data without sufficient data quality or controls in place to be able to monitor the data. Lack of visibility, like we’re saying like lack of access is a key issue as the relationships and the physical data to the business purposes is not something that’s inherent or automatic. When gathering data, it’s something that must be curated and added to the data so that everybody understands it. variety and velocity to visa we mentioned earlier can introduce challenges and maintaining the precision of different data elements variety in different data formats means that you have a numerous options to pick, as far as storage and compute the data and so this can be a challenge by leading to skills gaps within the organization. And finally the migration of legacy systems to next generation applications can be costly, IF proper planning for fit and support aren’t in place. It’s pretty common to see lift and shift methods trying to move wholesale applications that were like previously connecting to on premise DBMS systems directly into Hadoop or cloud systems. And if that processing wasn’t originally architected to be able to take advantage of distributed systems or new methods of processing. The results, can, can be not not as great as we would hope and. And so there’s some need there for re architecting. So all these issues can hinder an organization’s ability to create trust in the data and find value in that data. So let’s get our first consideration. Being able to discover and understand the data. And then with that we have our first poll.
[Brett Carpenter] Perfect. And, you know, again you can use the vote tab that’s located just below the player. And The question is, What challenges do you do your users have finding data and identifying the data that applies to the specific business challenge. Quickly, being able to access the data to prepare it for the task. Understanding the data distribution origin and gaps, sharing data and process between different users business and IT, or all of the above.
So we’re going to leave this open for a little bit longer. You know 30-45 seconds to give you guys a chance to, to answer. And then we’ll, we’ll go over the responses.
We’ve got our first all the above. I think you could you could go on and on with these types of questions. These are some, some fairly common ones that I’ve run across so it’s interesting to see the, the answers coming in. That’s an even split. He got a statistical tie. There we go. So let’s give it another 15 seconds and then we’ll stop the voting.
[Clark Bradley] Okay. Let’s take a look at these results. The major ones coming out are identifying the data that applies for specific business challenge is tied with sharing data and process between different users. But the highest one so far is all the above. All right, I think we’ve probably reached our stopping point.
So, let’s talk about this consideration of being able to discover and understand the data through modern data platform, you know, and for that let’s define what is a catalog. So a catalog provides organizations with a central point to define the full context of their data. And so what that means is that it’s a combination of technical and business terminology. So, you sometimes you hear it referred to on the technical side has a data dictionary, and on the business side of the Business Glossary and so it catalog provides the best aspects of both of those. So that different roles in the organization have different needs when it comes to understanding the data, and a catalog provides all the details to empower the users to capitalize on the data. An important aspect is that it needs to be able to scale with the data. So as the data is manipulated and reorganized, even the very definition of data so business definitions can change over time. The data catalog needs to be able to scale with those changes. This enables organizations with the ability to manage those changes in the data over time. So finally, the data catalog promotes data as an asset within the organization, the data catalog is going to establish the business usage of the data and how the data relates across business operations. So by treating the data as an asset. It enables collaboration and supports consistency of business results when all the users are leveraging the same data. So what aspects of a data catalog are important. Well in my little diagram here down at the bottom, I’ve got a virtual representation of a data lake, and you can see that I’ve got different artifacts different bits of data that are in my data lake, but it’s really hard to tell what they are mainly because they’ve all got question marks on them right so I can’t really tell what those are. And so, I need to be able to understand what that data is. So, even though the uses of data are going to differ across different users and groups all users must be able to see data through the same lens. And so, metadata falls into essentially three categories so we talked about to earlier on the technical metadata which is data like a data dictionary that defines the form and the structure or, in the case of unstructured data the lack of that structure in the data and the business metadata, which imparts the business context of the data, which helps reduce the barriers between human and human interactions. The third piece of metadata is the operational metadata, and that explains the transformations how the data is being refined over time that we’ll get to in the webinar in a little bit. So data catalog needs to be able to empower the users with a consistent unified view. And so with this unified view it means that all the users are speaking the same language. This feature empowers the organization with clarity into the data, collaborating on the data through this unified view also means that we need to be able to have ease of managing additions and changes to the metadata, so that the data is always accurate and timely, so for instances. If you add new terms or tags or labels from a from a business standpoint, user needs to be able to add those attributes to the catalog, so that all the users are easily able to find the data, and from a technical standpoint, as the structure changes this is sometimes known as schema shift and understanding of the data changes, those attributes also need to be quickly integrated into the catalog. A Global Search delivers a centralized point for organizations to discover that data so rather than having users go through complex filters or nested searches. Global Search is gonna allow users a comprehensive way to discover the artifacts across the three different types of metadata that we have here so business operational and technical metadata. So a business user might choose to find data by specific tag, so like if we’re talking about financial maybe we tag things that are credit card transactions or mortgages or loans or something like that or different labels that are, you know, part of that metadata, but a data engineer might be looking for data, completely different. So, the data engineer might be going in and try to identify data that’s part of a workflow, or associated with a common set of data quality rules. But, you know, no matter what the pattern is that the users are going to find the data. They should get the same conclusions that should lead them to the same data, rather than various forms or versions of that data. Another aspect of a catalog that’s important, is being able to capture all the metadata. So, as the data changes over time, and it moves from source to consumption users need to be able to understand those transformations and transitions in the data. If any additional data sources are blended into the data, or if the structure of the data changes, this is all going to be captured through what’s known as data lineage, so data lineage shows that progression of data, as its transformed cleansed and provisioned for specific task. Alright so we’ve talked about the first consideration of having a modern data platform and being able to understand and discover your data and that’s really representative of having a catalog that catalog is going to be the central nexus of how your organization comes together to be able to find and locate data and be able to enrich it with data that’s going to help everybody be talking about data in the same context. The next consideration of having a modern data platform that we’re going to talk about is validating the quality of the data.
[Brett Carpenter] And we have our second poll. All right, so the question here is, What is your top cause for inconsistent data quality. That was inadequate technology for detecting human error and misaligned standards across data sources. Duplicate data or incomplete records or gaps in the data. We’re going to keep the poll up for about a minute to give everybody a chance to respond. You know again use the vote tab that’s located just below the player window. All right, and let’s, let’s wait for some of these results to come in.
[Clark Bradley] Interesting distribution so far. I used to have a boss that would tell me this isn’t a challenge so much as an opportunity. Somewhere somebody can make a name for themselves by cleaning these things up. Making data consistent is pretty uniform as far as the major one now. Let’s leave it open for another 10 seconds, and get any stragglers. All right, I’ll go ahead and close the poll. The number one response. 45% of you said it’s around misaligned standards across data sources, and the number two and number three. Were a tie number two being human error and duplicate data, and the number three being inadequate technology and incomplete records or gaps in the data. Very interesting. So with that, let’s talk about when we’re validating the quality of the data. What are some of the challenges to maintaining data quality. So back to my, my diagram down here at the bottom, is that, you know, we’ve we’ve identified we understand the state and now now maybe some of the data isn’t what we want. If you’re not a fish eater, you might prefer the boot in the candy right maybe, maybe you’re a recycler and so cane is exactly what you’re looking for in my world. I’m saying that that fish are the important artifacts of data that I want to find and be able to have access to so data quality really helps organizations, to be able to discover, you know, some of the challenge here is around that discovery of redundant, or missing information so the challenge here is that data falls into a problem of just not understanding the data right from a context standpoint, generally more on the business side than on the technical side I’ve run into a lot of organizations that have these gaps in the data, and it’s just problematic to be able to find them. So, the data issues could be introduced at any point during the pipeline from a source system, or maybe even during the data preparation phase itself, but the important here is that you want to find them before they find you, is the absolute necessity here and we’ll get into that in a minute because missing information can cause problems so like if you have no entries and adjoining columns that redundancy, can lead to a number of different issues. As far as the replication of data that takes place, and in general, just creates a lot of distrust in the environment for the data that you have data quality techniques themselves especially around standardization can be pretty complex. So if you can imagine an example like where you’re standardizing contact information, and you’ve got state or province. That could have issues with spelling or casing or mixed abbreviations. And so these data quality procedures are based on complex rules, and they take into account like every permutation of the data. When they’re trying to standardize that data and shape it into a consistent value. So, even though the data is inconsistent that work has to be done consistently across the data and the structure. So that can be a challenge. Another challenge here is around sensitive data. That’s always been important but now in the days of GDPR the global data protection regulation as well as new laws that are coming out to protect personal data. It’s now more important than ever to have the mechanisms in place to be able to secure the data. So the proper techniques, need to be in place to ensure that the data like for example as a social security number, remain at a consistent cardinality so if you’re doing any type of analytics on top of the data and you’re reliant on that value to be a unique identifier, using simple masking and things like that, just, you know, aren’t going to give you the results that you’re looking for, and at the same time that data’s got to remain completely isolated. Data remediation prophecies, provide a clear path to monitor and resolve data quality issues. And so this part of the data pipeline is often overlooked when these systems are being architected. Sometimes, due to data ownership not being very clear so remediation is all about checking with the data owners to see if the data can either be removed or you know how it should be cleaned up depending on the source system that it comes from. So a lot of times you’ll see processes that can become a patchwork of different solutions that are more reactive in nature, and can be quickly overwhelmed by new data sources or larger data volumes. How do we ensure robust data quality in the data lake. Well, you know it’s it kind of goes back to a little bit what we were talking about around the catalog, you know, but this time around quality is that we need to be able to understand it and we need to be able to fix it. So sorry if you were hoping that the, the camera important in my world or not so we need a process where we can we can easily find those things, and remediate them. Having robust data quality doesn’t really begin and end at software purchase which is hard for a software vendor to say out loud but really the software it’s it’s an important part of the stack, because it’s providing the necessary tools to monitor transform and support the governance of the data. But the other key components around having robust data quality is really within the organization itself. So, data quality as part of an effective data governance strategy is best served by having people policy and process. So the people being stakeholders and data stewards, identify the quality of the data and define the standards, and then those can be developed and executed across the data lake. So the software stack should support that process by enabling that cross organization collaboration right and so that that’s definitely helped out by the catalog, because everybody’s focused and looking at the right data that they want to attach that data quality to and how to refine it, and the cross-collaboration just really helps speed up the monitoring and validation of the data throughout the enterprise. With ever-growing consumption of data, automation comes up quite a bit and it’s becoming a necessity, pretty much in all in all facets of data processes seeing whether we’re talking about analytics or data management automation and data quality means understanding the state of that data as early as possible so before downstream systems are consuming bad data we need to know, you know exactly you know by audit and monitoring. What the quality of that data is so that people aren’t getting access to data that are going to skew results. And by learning from that data quality, those issues and incorporating that learning into the process, bad data can get held back until it’s ready for consumption so that means either sending it back to a source system that has a remediation process in place to do it. Or if you’re tasked with that you know having that as part of your transformation and Data Quality Strategy organizations can get started, even, you know, early on with like initial steps like during ingestion incorporating in profiling and analysis of that data for data quality, which can speed up those quality tasks because we know early on, just as we’re consuming data before it’s even landed, what its state of quality is when we talk about the data lake. The architecture a zone based architecture can play a key role here in orchestrating the data from raw to refine. So, by utilizing a zone based architecture organizations can maintain that original raw data consistently with source systems, while at the same time progressively transforming the data to support many other use cases. So some of the things you see around here like sandbox zones, which are going to help. Allow business users in the area to discover and develop data quickly. You see a lot of work, take place in sandboxes by data scientists being doing more analytic data manipulation like interpolation and imputations to the date of things that wouldn’t necessarily fall to general users, and then enterprise level zones, like a trusted and a refined zone for general consumption. And so, these, these zones moving across trusted and refined are just getting the data to really set the transition and phases of data quality for specific purpose, incorporating a data remediation process that in continuously improves data quality is important. This turns those reactive solutions into a proactive solution to keep the data moving quickly through a modern data platform. So the remediation process is as it’s as much about really resolving bad data as it’s also about planning for obsolescence, which can mean deletion so we might be permanently deleting records and we might have to do to retention policy, or it might mean moving data to level with cold storage which are becoming more and common in the cloud-based architectures. So, at best, the data remediation process is collaborative, to be able to discover organize cleanse and provision data. Alright so so far we’ve, we’ve talked about two considerations of having a modern data platform. We talked about leveraging a catalog to, to understand and discover data. And then we talked about the necessary pieces and parts to ensure data quality you know which goes back to understanding and being able to monitor that data quality and then remediate the data as necessary. So the last consideration for a modern data platform we’re going to talk about is self service to the data.
[Brett Carpenter] And we have our final poll. All right. And the question here is, is self-service part of your data strategy choices are, we are completely self-service. When mostly use self-service tools, augmented by some IT services. We heavily USE IT services with some ad hoc capabilities to the business. And there are no self-service capabilities, but we’re planning to in the future. Now again use the vote tab located just below the player window. We’ll leave this open again for another minute or so to let everybody get their responses. All right, maybe another 20 seconds or so. It’s interesting that there’s no completely self service people out there.
[Clark Bradley] There’s got to be somebody with Excel Brett. The best, the best self service tool, ever invented a few more seconds. I’ll go ahead and close out the poll. All right, so we had a tie for number one around. We heavily USE IT services with some add capabilities to the business, and no cell service capabilities. But we’re planning to in the future. And then the number two response was we mostly use self service tools, augmented by some IT services, but nobody said that they are completely self service. All right. So let’s talk about self service and what is a self service data strategy. If any of you heard of tools called data wrangling and data blending. So these are essentially what these self service tools are so I kind of tongue in cheek was talking about Excel though, to be honest, it’s not surprising to see a lot of desktop users leveraging tools like that for self service and getting ODBC or JDBC access to the data and pulling things back to their desktop to do work, but as far as data management and self service in the last five to seven years there’s been this explosion of tools in this space to access visualize transform and cleanse big data. So these tools provide a variety of different advanced capabilities to business users of all skill levels, and for it, what these tools provide is the ability to enhance data governance. So, we pull up the different roles here. The tools are very interesting, because they speak to a wide variety of business users so business analysts. Traditionally might not have. They might have some or no coding skills at all right so I’ve run across some business analysts they can do SQL, whether that’s here nor there or high or low I’m not sure but generally they have a few coding skills, and a drag and drop interface or a wizard driven interface, makes it easy for that user to develop those complex tasks, but at the same time on the IT side that data processing is created very efficiently, because the application itself is writing the code that’s being submitted to the system, and it’s generally hooked into very efficient processing systems like a spark for instance or a database, a data scientist will tell you that they love to code, or sometimes they live to code I think they live to code. And they can create really deep complex processing on a variety of different types of data, but they need a framework, in order to operationalize those completed tasks. Right, so they don’t remind remain just in development so the best self service tools provide the infrastructure for the analyst and the data scientist to be able to collaborate on those tasks, while at the same time. Being able to provide it, which is the data engineer with the path to be able to manage and optimize the process for repeatability, So this is really how they all kind of collaborate together, is that business analysts can visualize and look at the data, they know the business rules they know how they want to apply the business rules. They might collaborate with the data scientists to do more complex tasks they interweave that code maybe into like some type of pipeline. And then ultimately it needs to be handed off to it to be managed and monitored and put into a production system. So, what values do self service, add does self service add to the data strategy. Well, it’s all about that collaboration right, it’s all about bringing all these users together to be able to work with the data. So, self service is all about accelerating that time to value with the data. I think it was Gardner by the Forrester that quoted that 80% of the time that a data scientist spends is on data preparation. And then the other 20% of the time is spent complaining about the data preparation. Now seriously it’s the, they say it’s like the 20% is like the is based on what they work with the analytics. And so the goal here is to flip that right we want to spend more time with the analytics and driving towards the insights than we do on preparing the data. So this is one of those ways to accelerate it.
The Self Service tools of a modern data platfrom provide the necessary security layer to be able to pass down user authentication and authorization into the data lake so this provides it with the capabilities to manage, monitor and audit, the users access and workflows so in combination like we said with the zone based architecture earlier is that users have their space, and then they have the tooling necessary to be able to write that efficient code, so that it is able to leverage these additional resources across the organization. Self Service tool injects flexibility by allowing these users with different skills to securely interact with the data, and it provides the agility for all users to get access and ramped up quickly so rather than having the skills gaps and trying to find systems that fit, you know what your people can do these tools are just ready to go, and generally can be put in place very quickly, and can be ramped up very quickly with very little enablement generally supporting if we’re on the data science side the same types of coding paradigms like Python and or an API is that they be used to leveraging any way, or on the business analyst side, a very intuitive interface that they can just quickly start working with the data and understanding the data, and making changes to the data. So the overall goal here is the creation of a repeatable process where tasks are promoted into production environments and monitored. And so, you know, going back we were talking about that operational metadata and so that’s really what this is is encapsulating those tasks in the metadata so that organizations are able to quickly productionize the work and provide the visibility to the data, as well as the transformations. So with wide availability the Self Service tools today. It’s really important to ensure that the Self Service tools provide efficient transformation to a modern data platform, so it’s giving you the access that you need. It’s giving you the type of processing that you need. On the data. It provides the user security. So role-based authentication so that you can give different levels of access to, let’s say, a data scientist and you might to a data analyst or business analysts and it having you know more responsibility there to being able to move tasks and things like that, as well as contributing to that overall enterprise metadata right so that’s another area that’s important, that is the solutions get built out. you really want to have strong data governance that’s automatic because the system the applications themselves are contributing that business technical and operational metadata to the enterprise, so that we can get a clear visibility of the entire operation. So, in summary, it’s all about getting to the fish. Right, finding out the artifacts in the environment, making them an assets, you know, understanding them enriching them with data. Playing the quality rules that we’re getting good insights out of the data, and then the tooling necessary for users to be able to get to the data. So the data catalog enables all users to be able to find and understand the data to achieve quality. We need people and policy enabling the processing and self service is a critical to being able to scale, and accelerate that time to value. I want to thank everybody for their time I really appreciate you attending. Are there any questions.
[Brett Carpenter] Again, if you have any questions you can use the Ask a question box that’s located just below the player window. And as we wait for some more to come in, we’ve actually had a few, few comments throughout the course of the presentation. So I’ll jump right into it. So let’s see first one is, what is the difference between a data dictionary and a data catalog.
[Clark Bradley] That’s a good question. So the data dictionary really defines the technical aspects of the data right so it’s all around. Sometimes it’s called DDL, or data Definition Language. It’s all around that structure. And if we’re talking about more advanced systems, it’s the storage mindset wrong Amazon has it in s3. Is it in a binary format is it in the ASCII format of text, and even the statistics around the data that we talked about like cardinality number of null values and xand gaps and things like that and so all of those are about the data’s physical form, and a good data catalog is going to incorporate metadata from the data dictionary as part of the catalog, it’s also going to include that that business terminology as well so it’s really the best of both and that’s really what allows the it in the business to be able to talk about those datasets by having that singular view of the data.
[Brett Carpenter] Alright. Let’s see the next one is actually touches on one of the the four, four V’s of big data says How can organizations address data veracity threats?
[Clark Bradley] so just, you know, going back is just about being able to have that trust in the data and this is kind of an interesting area because when Big Data first came out, we only talked about the three of these. And then the fourth day and sometimes the 50s, you know they’re kind of similar as brassing value and talk about value but you know you can you can surmise value as you know the organization, finding value in the data you know getting rid of stuff that they don’t need. But around brassy, there’s a lot going on today around, being able to apply. Trust score, driven by data scientists into the catalog so either attached to a data catalog or as part of an overall data quality process, meaning that we, we have a trust score on the data. And that was going to help drive, whether we’re going to remove or silo off poorly trusted data. And so that’s a good idea. And data intelligence practices are starting to form within organizations to address manipulated data. So if you’re doing any type of changes to the data. These data intelligence practices are setting the standards for what’s acceptable as far as risk.
[Brett Carpenter] All right, add another one command that says, I assume operational metadata is going to be captured by the tool. How can I ensure business metadata is captured, or maintained seamlessly.
[Clark Bradley] That’s a good question. The operational one can can be tricky. There’s several different third parties that will openly share the operational metadata from a from a data management standpoint, so I’m going to focus on that analytics is a whole nother ball of wax, but that operational metadata may or may not be shared. And so, when these systems are being architected it’s important to make sure that that you have one theater comprehensive or that you’re got a system of different pieces of software that will share that metadata, so that collectively. You can bring those all together in one spot, business, or men data or data catalog in general is generally more of a manual activity, meaning that there’s, there’s some forms of automation that can be done. Depending on where the data is coming from and if categories of the data and things like that are defined. But, in general, this is where having the people in place the stakeholders and the data stewards, to have a user interface into the data to be able to work with it. I’ll give you a quick story that I’ve worked with an organization, and they had one guy that had a set of business rules on how a specific code was to be dealt with over time as that code got transitioned from a hierarchy of people within a policy. And he was the only one that had it told told us that it took him years to get it he kept it on a piece of paper and shared it with us but everybody had to go to him to get that business, understanding of what that code was and as it changed forms what the definition of it was because there was no unified systems everybody had to go to him so if he if he lost his job and when he took the piece of paper home. You know all that gets lost so it’s very important to make sure that the systems can integrate that metadata, good question.
[Brett Carpenter] All right, we’ve got another one that came in, says, How do I how do I shift or how do I transition from a data warehouse into a data lake.
[Clark Bradley] So there’s I think a number of things leading up to that to figure out even should go to a data lake, or not, right so it goes back to you know we talked about how do you know if you’ve got big data and now a lot of organizations. And I kind of hinted at this early on aren’t even talking about big data anymore it’s just data, it’s just the way it is. So if you’re starting to hit those three V’s that kind of a known turning point that you need a different systems going to be able to handle. You know, from a simple standpoint the no SQL versus the SQL types of transactions it’s broader than that but you know that’s kind of a defining feature to know that if you need to shift, and then as you shift. I think that the, you know, certainly these considerations for a modern data platform here are the important parts that you need to consider to be able to do that shift. But, you know, more importantly, you know, to your second question is how do you transition from a data warehouse to a day like, it can be overwhelming. It can be very overwhelming for organizations to look at it from the entire, you know, an entire pie standpoint, I worked with a company, Pacific. A few years back that went through this and they did a lift and shift and their processing had gone into like something around 27 or 30, hours, whereas previously on the databases. It was you know, sometime around seven or eight hours so they were, they were struggling because they done the lift and shift and so from a strategic standpoint. I think it’s best to start with, you know, once you figure out if you need to move to daylight that’s one step. The second step is to start categorizing the types of work that you’re doing and focus on the ones to start off with, what are the most well understood within the organization, don’t try to bite off the entire thing don’t don’t set up a timescale of like six months we’re going to move everything into the data lake, take those ones that you really understand well apply those into a development environment, and you’re going to learn a lot through that process. And if you take the ones you understand the best. You’re going to understand what you have to do to architect them to work well in that new data like environment because chances are processing times are going to change and you’re going to deal with format issues and you’re going to be picking out what’s the best format of your data to support the downstream systems, whether it’s reporting or analytics or visuals and start applying those to the harder challenges that you have operationally.
[Brett Carpenter] All right, perfect. Well, that looks like that’s all the questions that we’ve got.
So I want to take this time to thank all of you for for joining us for this presentation. And again to Clark for for taking the time to speak with us about what it, but what you need to consider when creating a modern data platform. This presentation will be available on demand on solonius resources hub and the brighttalk platform for future viewing the slides will be available in the Attachments tab located just below the player window. Thanks again, and we will see you next time. Thanks everybody.