A Modern Digital Data Architecture: Best Practices for Adoption

March 13th, 2019

Organizations that put analytics and artificial intelligence (AI) at the core of their transformation strategy will survive and thrive in the age of digital disruption. To achieve this, a holistic, modern data architecture and a rock-solid information supply chain are critical for success.

Organizations can deliver timely, self-service, democratized data access and analytical insights at enterprise scale by leveraging the innovation design principles of data lakes, scalable and elastic cloud infrastructures, and automated information pipelines. However, many find that these architectures are complex to create, deploy and operate — often resulting in poor performance, unnecessary expense and underutilized assets for the do-it-yourselfers. Transitioning to such architectures from legacy paradigms carries additional difficulty and risk, especially in hybrid environments that can span multiple design patterns and cloud providers.

In this webinar, Clark Bradley, Zaloni solutions engineer, and Alex Gurevich, DXC Technology’s Analytics chief technology officer for the Americas, will present solution designs and representative field-use cases for simplifying and accelerating adoption of a modern, digital data architecture.

Topics to be discussed will include:
– Best practices for migrating from a legacy to a modern data architecture
– Deploying a data catalog in support of data lake architectures
– Data lake architectures for hybrid and cloud environments
– Protecting data assets and privacy without obstructing access


secure the unrealized power of your data click for demo

Join experts from DXC Technology and Zaloni as they present solutions to data sprawl and data architecture best practices.

Are you ready to adopt a modern data architecture? Get your demo of Zaloni Arena DataOps platform today!

Read the webinar transcript here:

[Alex Gurevich] we’ll take a quick second just to introduce ourselves and our joint companies and the partnership that is the exciting genesis for this webinar. My name is Alex Gurevich, spread them today’s, I’m the Chief Technology Officer for analytics on the Americas. I represent DXC technology, a company that you may not be familiar with by current labeling. But it’s a new company but not born yesterday as we say it is the genesis of the merger between Hewlett Packard Enterprise Services and CSC formed, almost two years ago now, and it is the third-largest service company in the world today so a lot of the work that we do brings it services, and digital transformation capabilities to our customers across the world. Clark. 

[Clark Bradley] Thanks Alex, and my name is Claire Bradley I’m a solutions engineer pre sales. With Zaloni, and so slowly also it was born out of services in the in the Big Data era. and then productized, all of our learnings there. And so, we like to look at this around three different pillars around being able to enable the data lake with different capabilities from ingestion to transformation governing the data platform, through tight integration with security and authorization, but also securing that data. And then finally engaging with the business through a variety of self service tasks that make getting access to the day the easy. And so all the different capabilities help companies to modernize their approach to operationalize their business processes. So, for DXC and Zaloni this just feels like a very natural fit for our joint customers to be able to optimize the value of their data and provide a clear path to digital transformation. 

[Alex Gurevich] So one of the things that DXC and Zaloni have done is, as part of a broader offering by DXC for their analytics and AI platform where we deliver platform services, the supporting services and to and from all consulting to design development deployment, as well as the run operate services. We have aligned, our platform services to include Zaloni as part of the core components of our platform environment. That really focuses on the enhanced data management and self service. Some of the things that Clark just mentioned Zaloni is really focused on and best in breed and as part of the DXC capability. We wrap, a lot of those services and capabilities that Zaloni brings with the actual environment management, whether on premise or in the cloud, as well as the new age data lake constructs that are supported by the Hadoop technologies and cloud native deployments within the public cloud as well as private clouds and hybrid deployments. And then of course expanding that to the consumption and usage of that data by BI systems integrations with the operational production. Implementations as well as the application development that sits and leverages the advanced analytics and artificial intelligence that are derived from the data that are consumed. 

[Clark Bradley] Yeah, and I think that’s where Zaloni fits perfectly in with, with the types of services and solutions that DXC is providing is that we’re right there in the center, just above the data layer, and in between the application layer, is that what we provide is a management of governance layer that allows users to automate and collaborate across their data. And so some of the key benefits there that I would point out is number one around integrated metadata is that there’s a lot of different varieties and catalogs that are available on the market. But what we found that our customers need is an actionable data catalog and for that. You’ve got to be able to integrate with the business. the technical and the operational metadata. And so that’s everything going on in the environment the business and the technical allow the it and the business folks to be able to better contextualize and understand the data through the discovery process, and then operationalizing the metadata coming out of different transformations and preparation tasks and provisioning tasks, gives clear line of sight of where the data started and where it finished up. 

Second point is around providing a simplified set of tools and automation is that these tasks can take weeks, if not sometimes months. Just something as simple as accessing data can be a lot more onerous than, then one would hope for. But in our customers than in the organizations that we’ve worked with the ability to seamlessly integrate the hydration and the profiling the gathering of descriptive statistical data on the data so that understanding starts as early as possible. can really help kick start. Many organizations, large and small data lake projects, much, much faster than trying to pull these pieces and parts together alone. 

[Alex] The whole concept here is that although we are obviously DXC and Zaloni on the platform and services that were built out are representative of what we believe are the best practices associated with building solutions for the modern digital transformation age, the end to end capability is critical to managing the platform deployments, and the cost to market that’s associated to that is the result of optimizing the tool sets and the capabilities that those tools will represent in an integrated fashion. So the idea of having a one simplified out of the box solution that can be deployed across multiple environments. Hybrid specially, but optimized and also for cloud, and very important in the transformation many of our customers are migrating from one environment to other adopting new environments as an extension. And in some cases, transitioning towards changing their operations to be optimized across the new environment, and choose a new paradigm of operations, including constructs like the data lake as opposed to a traditional data warehouse approach. As part of the things that we’ve already built out, we wanted to give a quick heads up on some of the things that we also are moving towards, and some things that are just around the corner. So as we move to the next slide we can talk about some of the innovations and enhancements that are constantly evolving and this is the direction that they’re headed. 

[Clark] Yeah, that’s a good handoff there Alex, you know we’re working across three key trends over the short term here, and so we label these as Connect versus collect personas rule and data intelligence and so in the Connect versus collect. This is all about the discovery of data, you know, really, really helping users to understand the data, versus just having larger and larger quantities of data. And so for this. Being able to split processing seamlessly across different cloud environments to be able to catalog file systems. Applications our DBMS and really understand. By doing you know both previews and data ingestion help users to get that better understanding of their data. We’ll be adding additional seamless capabilities around auto scaling of spark clusters so as different data pipelines and flows, get built out, being able to automatically in real time scale that you know from two to 200 nodes, or to scale back as processing at nighttime doesn’t need as much are going to be key components, they’re also data access in Federation so being able to leverage that catalog and being able to across a number of different environments seamlessly connect and relate data sets, is another key feature that really helps users understand data that’s been siloed either due to different business areas, or for merger and acquisition, to be able to gather that data together, to determine usage for a business task they’re working on. under personas rule. This is more aimed towards all the different roles the diversity of roles that we see across environments. You’ve got data engineers, you’ve got data scientists, new role that popped up a few years ago citizen data scientists which were business analysts are leveraging more predictive analytics and an automated way in order to take advantage of skills that. This is more aimed towards all the different roles the diversity of roles that we see across environments. You’ve got data engineers, you’ve got data scientists, new role that popped up a few years ago citizen data scientists which were business analysts are leveraging more predictive analytics and an automated way in order to take advantage of skills that they had data stewards on the, on the data governance front data analysts right so there’s lots of different roles, and for them to be able to achieve what they’re working on having different tools that don’t quite connect that don’t quite share metadata is really can create gridlock gridlock and eight different business process that’s being worked on. So, we’re going to be aiming to further add Indian capabilities for each persona, so being able to connect to like Jupiter notebooks to seamlessly pull in Python or R code to operationalize it through a data pipeline is a good use case they’re furthering self service data ingestion and publishing. So this is where users can bring the skills that they have rather than having to learn a new skill to get up and running in a particular environment providing wizard based or drag and drop interfaces that let users really take what they know the business requirements and apply it without having to have the technical know how advancing our global search, and catalog capabilities. Further adds metadata enrichment for an easy to find experience. So as users go into an environment rather than having to scan across different catalogs or if you’re a data engineer, look deep into a transformation and workload, leveraging the global search will allow them to pull out those pieces and parts that will take them directly to the task or the particular data item that they need. And then finally extensible user actions here so you know going back to the personas data scientist want to do more code based activities. Excuse me a business analyst need more activities on a UI or visualization scale. So we’re going to allow them to be able to extend those actions out their customized actions so that they can build them onto the application for whatever their needs are. Last piece of our theme is data intelligence which is a really exciting area for for data management for things like Eon now, Alex, different folks working in analytics that have predictive analytics and machine learning, you know, it’s nothing new but in the data management space. Being able to take the information that we have today workflows that have been built, different tasks that have been operationalized along with the descriptive statistics that we received from profiles of different data sets and the knowledge that we can pull from the systems that those datasets reside currently in is going to allow us to make smart suggestions, so we’ll be able to leverage machine learning capabilities in order to take the information that we have look across history for activities that have been occurring in the environment and help users get a kickstart as they onboard into an environment to be able to make those smart suggestions. So this will also help with duplicate ingestion forensics, we won’t have to hopefully know, once the day lands that it’s duplicated and we don’t need it right we’ll be able to look across the profiles and say, Hey, I don’t think we want that data here. And then finally, you know as we described earlier in the personas rule the data science notebook integration. In today’s world we have more custom written transformations that we can operationalize any type of procedural code that will land in the data lake. And we want to make that more of a seamless experience so that as users are working with their favorite coding tool sets, they’ll be able to connect a module into our environment where we will seamlessly add that into a workflow, and then operationalize it schedule it or, or habit event base to kick off later on so that work by the citizen data scientist and the data scientists can easily be scheduled in the environment. 

[Alex] Perfect. So let’s just jump right into some of the use case patterns that we want to talk about because the things that we’re describing I think will come up over and over again in context but we want to take the time to walk you through what we see are the core. Use Case patterns that we’re seeing most of our customers struggle with. And even though there are different ways to address many of these, we kind of created these three buckets of a hybrid data lake, which essentially is the idea that you’re migrating from typically an on premise environment to a, let’s say a cloud environment and either during the duration of that project, or perhaps as a permanent state will wind up. Maintaining both of those environments to operate against, and the challenges associated with that the enterprise data warehouse the data lake addresses the paradigm shift of moving from a traditional structure first, then load, then use approach of a data warehouse.


Mostly organized around traditional structured data and moving towards the paradigm of a data lake, as well as the challenge, often face now of adopting cloud in that same context. And then the idea of a Greenfield cloud environment is embracing that new digital capabilities of the cloud and some of these new technologies and building new capabilities in those environments in context of other data sources and other capabilities that may come from outside of that Greenfield environment. So one of the things that we wanted to do is take a poll, which of these patterns describes your most critical focus so as maybe we continue talking about this. We invite the folks to execute the, the trigger off on the response we’ll take a quick minute to let you do that. And then while we do that, Clark anything you want to add about these three patterns in terms of what you’re seeing in the marketplace and how solonius reacted to this. 

[Clark] Now these are good ones we’ve, we’ve seen over the past couple of years that definitely personalized data security has become a trigger for a lot of these movements to the cloud, and whether you’re talking about a hybrid environment where you’re kind of straddling your on premise environment, a new cloud environment or you’re trying to move wholesale to a cloud or you’re brand new to the cloud. You don’t even know where you want to start you might be a small business unit and so that’s a great place to start with data Database as a Service and some small DevOps type tasks against that data. These are the ones that we predominantly see out there and they all bring their own set of challenges to the game. 

[Alex] Excellent. And, you know, it looks like based on the participation as as expected, the order of these use cases is approximately how we see them position in the marketplace most folks are what whether by choice or by compliance are embracing hybrid data lake kind of strategy. Many of them are moving towards specifics of the traditional warehouse to data lake migration and a few are have the ability and, frankly, the, the benefit of starting with a Greenfield without necessarily having to worry about some of the legacies transitions of previous deployments. And with that, I agree and we see a couple of others coming in there so if from the audience if you want to put some feedback on what those others are we’d love to have a conversation around it. Great. 

So let’s just jump into a little bit more discussion on the first use case. Now the, the idea here is that the, the use case pattern of hybrid data lake really addresses, two things. One is of course the idea that there is an on premise versus a cloud environment so the paradigm shift of adopting a cloud operation. And then of course the, the challenges of technology as well as physics of connecting to the cloud and adopting some of the different technologies that are unique to the cloud, or that may be don’t transfer well from on premise like an appliance based solution that many customers have in place today whether it’s traditional Teradata or mateesah or exadata appliance which did a great job for what they were targeting to do at the time. And then moving to either a similar type workloads in the cloud, or moving towards, you know, a trend, a transition of a data lake that may have been deployed as a Hadoop data lake on premise and moving back to the environment to the cloud, and interacting across those technologies and environments. So one of the things that we very much see is a gap in the metadata management. And, again, for DXC this was a clear opportunity to partner with a best in class partner Zaloni you because we believe very strongly that it’s the amp, part of the answer if not, maybe a significant part of the answer towards democratizing the data access, managing it simplifying it towards the goal of a self service, single pane of glass kind of operation. That allows users to know where the data is access it, and make sure that they access it within a governance criteria that are built in upfront, as opposed to bolted on after the fact and do it in a way that essentially minimizes the complexities of where such environments, operate whether they’re on premise, or whether they’re in cloud architecture. 

[Clark] Yeah, I mean that that last bit that you said is, you know, probably the key bit as you know wherever there they are, are operating because that’s really not important to the business user I mean they want to get things like scalability and performance out of it, but not be bogged down with having to deal with different types of data formats and having to deal with different processing paradigms you know distributed or sequential in different languages and things that can really hamper development because at the end of the day we’re trying to get to, to business insights. So some data architecture best practices, there are around, knowing your data. It wasn’t long that long ago that the data lake was essentially defined as Hadoop, and then as we saw different distributions come up and those distributions form their own formatting of data and their own processing paradigms to go with those environments. It was never exclusionary to the databases, they were always included and there was some form of integration between the data warehouse and the data lake needed in order to, to make these environments work. And as we move into Cloud that data lake term gets even more broad, as we include a dupe and RDBMS and files and streams and s3 and as your blob and data as a service like snowflake and redshift. There’s a variety of different pieces and parts, and so it’s important to, to know your data, and to have proper context. And so, that’s where a catalog is extremely important in an actionable data catalog which I’ll explain in a minute, but a catalog is important to give the different roles that we talked about earlier, the data scientists the data analyst business analyst data stewards data engineers their proper context of the data for understanding it, but more importantly, we need to be able to take action on that data, just having reference of the data is certainly important, but what are we going to do with the data. Where is it going, what is it supporting is it static reporting is it predictive analytics. Is it actionable off of some stream of data that we need to make a decision very quickly in real time, all those different pieces and parts are important. And so, by understanding that data that’s what’s going to help lead to action, you know whether we’re going to be applying data quality or, or, securing personalized data with masking and transformations and tokenization of data, or providing preparation, you know in a context of a pipeline for a data engineer or a wizard driven interface for self service. For more analyst type role. So finally, another data architecture best practice here is around the governance of the data. So we’re moving to a cloud environment, being able to support a hybrid cloud environment, being able to have governance over both environments on premise and hybrid in the cloud, are important, because users need to be able to log in and be able to perform some level of work, but they might need to be able to transfer that work or split that work between the two environments. So this is, again, you know, going back to the catalog, having that unified vision of all the data in both environments, being able to take action off of it so let’s say that we want to secure or merge some data in the on premise environment before we bring it over to the hybrid environment to support a report or visualization users need to be able to have that tool set that allows them to leverage the best parts of both environments, while still obfuscating the more complex aspects of it. With respect to data processing and formatting of data.

[Alex] I think the whole idea here is that the goal has to be business oriented right the idea that these are technology solutions to business problems, and business challenges and there was a question that came across the system you know some of the above criteria about choosing what type of cloud solution or what kind of pattern to use. In general, I would say all of these have to reflect your business goals. How are you going to measure success based on business outcomes, and then ensuring that those successes, can be executed in a simple cost effective and efficient operationally efficient way. And these are patterns that we find that address those challenges in the case of the hybrid data lake. As we move on, maybe to the next slide and show an example of how something like this fits in a specific deployment. The idea here is that we’re using an Azure example here. This could be AWS, it could be Google. DXC has patterns of deployment of those platform components. We can certainly help identify which components are necessary for what type of use case whether it’s deep learning algorithmic advanced analytics or whether it’s traditional bi, or any hybrid version thereof. But in the end, the idea is that what’s common across most customers is the on premise version of a data lake that may actually be quite a blunt is more of a data swamp, that’s been hydrated but maybe not optimized very well for usage and folks are having trouble extracting content and monetizing, the value that’s in the, in the data so they’re not quite using it as an asset yet. And then, embrace it embracing the flexibility of the cloud as part of the digital transformation, but really the big defining change is the adoption of that data catalog. And the idea that you are exposing the data to your user community, creating the automation that simplifies and optimizes the data pipes associated to the processes, behind the scenes. And then, facilitating the use and the embedded application of that data towards a business value. Clark. 

[Clark] yeah. And so this architecture resume reflects the sample use case that we have in the banking industry, and all these pieces in parts are largely optional, they’re they’re here to reflect the business case but aren’t required components so I want to make sure that that I point that out in our environment for for Zaloni. As far as processing, we leveraged most of the time, spark about 90% of the time it’s Spark, with some, some MapReduce and some hive being used at their Hadoop environments. And so here what we see is that zaloni could be installed either on premise and accessing directly. The on premise Hadoop cluster, and then remotely. The Azure environment or it could be installed in Azure and going back the other way through a virtual private network to the on premise environment we don’t that architecture is usually recommended based on locality of data access, security, and scalability and some other features that we work with customers on but largely what you can take away here from this. This example is the business case is that the bank wants to discover new opportunities in their data. So this goes across all their financial data mortgages loans, credit card information bank bank records and things like that so they want to look to be able to see things like this customer has a mortgage, offering them a new line of credit or maybe offering them financial planning if they hit a certain threshold. And so for that their technical challenges here were largely around scalability is that they had basically hit as far as they wanted to go with investments of the new benefits infrastructure. And so being able to save money on cost as far as expanding infrastructure and getting cheaper storage on cloud as well as the locality of being able to regionalize, different reporting and visualization automatically by landing that data in different regions of the cloud, made a lot of sense to them. So, here what we do is work to create that you know like we talked about earlier that one understanding of all the data across the environments, so that we get a clear picture of what’s on premise, and what state is it and Is it is it raw data has been transformed and trusted into an enterprise data sets have been refined into a business case. And as the data migrates over into the Azores Data Cloud. You know what does it represent there. And so we follow a full zone based architecture which is data architecture best practice. Through the Azure Data Lake storage system to transition that data from its raw state all the way through to its business case or even a sandbox ranelate development. Like for. 

[Alex] One of the things that you obviously struggled with and as, as you transition into this typically is the idea of which tools which environments, set up in what way what kind of management. Do you have to set up around this. How do you select the right resources within Azure to execute on how do you manage and monitor all of these things. This is a level of managed services that is really key. Whether you’re developing this in house, which takes usually quite a ramp up, and more if you are outsourcing this as a managed service. This is where DXC has built a lot of expertise and services around that helps you again. In the same way that you are simplifying the use of data, we are expanding that to the simplification of the use of these platforms deployments. And, for example, one of the services we have is an Azure migration factory where there’s on premise expertise about the environments that you have. Currently, as well as target environments expertise, and the operationalizing of that migration through repeated blueprints and best practices that has been validated and proven in the sample architecture as Clark said has some components that may be optional for your particular use case but this is an example of an actual client type solution that has been deployed and implemented in this particular way. And many of those consulting and advisory components are services that can be brought in from DXC and Zaloni to support you and if not from us then I urged you to bring a partner and Si, and an expert to the vendor to support you in this because it is quite a ramp up and this is where building on success is critical. As you move on and explore for the types of leverage of these environments. 

so, with that, maybe we can continue the same discussion and then just use the context of the next use case pattern and move on to the EDW transition use case and the key here is that this is the typical use case of, you may never have had the dupe cluster in place. Or, that’s not what you’re trying to transition to or from rather. Today you have an EDW whether again the classic appliance based deployments, or whether some of the more modern MPP solutions may be distributed compute, but still on premise. Still leveraging the traditional paradigm of pre schema the data ETL processes typically outside of the environments, moving the data into a pre structured environment and then exposing that environment through traditional query based and BI tool sets, which, as long as it’s serving your business needs. Today, you know, there’s nothing wrong with it however what we’re seeing is that the new technologies offer new paradigms of usage patterns that are more conducive to the data lake approach of store once used many times by many different users with either multiple schemas or perhaps even schema on demand. And the idea that transitioning from the pre structure the data pre harmonizing of pre processing it to keeping the data as whole, and as original as possible. And then, making the changes to see the specific business needs of the different business communities, which, especially in analytics and AI are more conducive to as much of the raw data access as possible. And then the idea that as you’re moving into these environments, you’re also adopting the construct of clouds to expand the flexibility and scalability and the elasticity of some of the cloud paradigms. The transition to that paradigm of AWS data lake is a challenge of its own, the transition of on premise to cloud is yet another paradigm shift, and this is a cultural change for many companies as well as the technological change work. construct of clouds to expand the flexibility and scalability and the elasticity of some of the cloud paradigms. The transition to that paradigm of AWS data lake is a challenge of its own, the transition of on premise to cloud is yet another paradigm shift, and this is a cultural change for many companies as well as the technological change work.

[Clark] Yeah, I agree with, with all those challenges we see quite a few of them, and you know from a recommendation standpoint. We work with different organizations on is to allow them to bridge those silos coming out of the enterprises, you know, lots of tools and as those new capabilities become available, adding them on to the existing infrastructure can sometimes be harder than it is just bringing just using the new tool itself right as it is the metadata doesn’t integrate and it does not provide clear path to auditability of the processing that taking place in the environment. It can be difficult even though it’s added new capabilities. And so it’s you know again leads back to having that unified metadata management to be able to provide a single actionable catalog that allows users to find all their data, and being able to have clear visibility of the data, and so leaning towards that is having a marketplace experience for business users in their data. So rather than having to edit, you know, most bureaucratic state, fill out forms and submissions to get data added to a common place to be able to just go look and access and touch the data. Having a instant view of that data in a shopping cart experience where you can select those different assets and take them to, you know, the specific spot where you want to do that processing right whether it’s unstructured data or structured data they specific spots that will optimize that experience. And so that optimization of transformation is another key recommendation, when transforming from the DW is that, as the DW gets built up more and more and more largely those, those transformations can flow down it gets harder and harder to meet SLA is as data volumes grow. And there might not be like Alex said earlier, Hadoop infrastructure in place to support unstructured data. So then that’s just more infrastructure that needs to be added to it so being able to take advantage of distributed processing in a cloud environment that’s scalable and is less costly, you know, makes a lot of sense for customers to be able to transition there to be able to parallelize their data loads and take advantage of zone based architecture to on the IT side better understand the current state of data for access by the users, and then from the user standpoint knowing what data is trusted that they can leverage for different business insights. Another recommendation I wanted to. 

[Alex] The one thing I want to just highlight, is the idea that a pattern that we’re seeing with users at enterprises now is a definition of production that has been kind of stretched and expanded from traditional. Many of these environments are exposing and joining data across sources that are not traditional to what we call legacy production operations. So this includes different type of formats of data so unstructured formats, whether it’s sensor logs from your IoT environments or IP logs from your back end system operations, but also inclusive things like weather data, social data, perhaps even video content. All of that becomes usable by the enterprise in production grade usage, but not necessarily always embedded into your operations holistically so there is data scientists who are exploring and evaluating data they need high quality, high grade enterprise capable environments, and then based on their operational usage of that data and proving in the value of that data that result can be embedded back into your systems. In order to enable all of those different usage patterns and users across your deployments, within the enterprise. You have to have those different sandboxes your different sets of governance, flexibility, and that is where the power of the cloud really comes into play because you can spin up these environments, you can feed additional resources into them. And you could still expose them with that common data catalog, and that data democratization that exposes the viable usage of all the data assets that you have in place across the different patterns of production great deployments.

[Clark] Yeah. Yep. Totally agree so you know just to sum up those last two points, it’s all about automation and collaboration refer collaboration, organizations need to have a self service strategy as part of their data strategy to enable business users that don’t necessarily have the skills in order to build the pipelines and procedural code to be able to work within the environment seamlessly. And then on the IT side being able to leverage movement of storage from hot to warm to cold so that they can save on cost of infrastructure is data architecture best practice. 

[Alex] Yeah, so let’s, let’s jump, jump into the architectural review of this use case. And a lot of the value of all of these use cases is the automation and the simplification of these deployments. So whether the sources in the past were going into an eT W for multiple separate tool sets and different environments with maybe even separate metadata management, if any, and the idea is that you have very disparate processes that you have to become very familiar with and in many cases have the right security access to the idea that all of these things become a little bit more automated and simplified for the end user, including IT management of these environments, is one of the keys to success, focusing on that simplification and automation up front with uses the effort of automating it later on, and adopting the tool sets that are conducive to that, especially tool sets that allow you that single pane of glass capability of developing those pipelines. With that automation with that auto generation of metadata upon ingestion will make really gives you that flexibility. And that is one of the value propositions, and the criteria by which you start adding value to the selection of the tools. Yes, you can use an existing it toolset for ETL, but doesn’t give you the value add of the data lineage across all the different environments. And can you pass the data lineage into some kind of a searchable accessible data catalog that users can use to understand the impact of their downstream application development, or perhaps just the correlation on understanding of how their data correlates to other data in the environment from different sources. 

[Clark] Yeah, that was a great overview this comes from a customer in the transportation industry that was looking to optimize supply chain so they get a variety of different data so structured unstructured semi structured coming from partners joint ventures third parties and even some, some existing data, bring your own data in order to help enrich and add to that. And so being able to have a flexible and scalable environment for them that can work with multiple formats of data was exactly what they needed to support their business process. 

[Alex] Yeah. And one of the things that is also important is really an understanding, especially when we talk about migrations and understanding of the source system, technology and the expertise of that environment. As you migrated towards a new so for example if you’re migrating from again like a Taradata environment understanding what’s unique about Taradata is very important as you start selecting and redesigning and choosing the strategy of what I’ll call lift and shift versus redesign versus lift and shift and then redesign. In some cases, moving the processes as is, is viable, but does it offer you the benefit of changing your business operations, or are you just changing one horse for another. Does it offer you better cost efficiencies better scalability, in many cases, it is just as important to prioritize and pull up the value of redesign into your process, as it is getting the data out into the new environment. And now we’re coming up on some time and we want to make sure that we have some more q&a type discussions. Let’s jump into the latter use case. This is an example of, you may have a specific business unit, that’s able to do something new. You may have a new company that’s, you know, call it born in the cloud to begin with. Whatever the reason, you have the ability in this case, to not be burdened as much by migration, but more of the deployment and adoption and leverage of the clouds state architecture. The biggest benefit here is that it’s a really fast time to market deployment, but the challenge here is that you can still do just as many complicated things as you did in migration, by not adopting some of the best practices and again that metadata management of catalog deployments, the end to end service layer that allows the management of the target environment in an optimized fashion that can be monitored and optimized elastically is very key and very important. So although this looks like an easy use case pattern to deploy, which can be. It carries with it, many of the same liabilities and complexities of transitional transformations. 

[Clark] Yes, Alex says those challenges with new state cloud. We see those often and what a lot of organizations can get overwhelmed with is that this is all brand new and it’s different than what they had before. But in many ways it’s better when they have three or four, and I’m gonna start with the last part first and work my way up but having experience can really help there, in order to take an existing data source or format take an existing structure of data or architecture or pipeline and really being able to revolutionize it in the cloud is important because there’s a lot of things to be taken advantage of here that can aid in in automation so it’s different data patterns are understand, and were able to automate those executions whereas previously they might have been more manual can help get a jumpstart on that. And so that those can be more complex tasks like taking data science and operationalizing it. Another recommendation here, is being able to integrate those business insights you know so catalog seems to be the Word of the Day here but it’s extremely important to have that actionable catalog to help understand that data, and to be able to make quick use of it. So being able to take the existing work that’s done and then adding it into modern processing, which means scalability and distribution of processing, but also by providing that business context, allows users to very quickly access the data and start processing it and that’s, that’s one of the key areas where we see the largest lag in time, the analysts tell us that 80% of data scientists time is spent preparing the data and the other 20% is complaining about the 80% of the time or something like that but it’s you know with the goal here is to speed up that data preparation time and so by having that understanding and being able to automate things we get there much quicker. 

[Alex] Right and it’s gone to the architecture deployment example of this. What you see here is an example of very similar pattern of deployments, but in an AWS environment. So architectural Lee principles are the same, you still have that overall platform to manage services across the platform. The operationalizing that data catalog within that environment, but within the context of, in this case, AWS Clark just making you do a quick test point of anything that’s unique to AWS here. You can then maybe move to some questions discussions. 

[Clark] Yeah, I think one of the things that hopefully the users are seeing here, they’re online is that you know that that zaloni set of tasks there just fits in like a glove and that’s on purpose is that as people are transitioning or going to hybrid or even moving around the clouds we saw a lot of customers change the new distributions and we’re seeing the same thing in the cloud, multiple clouds or AWS today and tomorrow as they take advantage of cost of storage or better processing or new features. That’s what the tooling was written to do was to be flexible in its incorporation with different environments and so here what you see is that our zone based architecture in this example for a customer in the publishing industry, allow them to migrate their data from landing to raw to sandbox through s3 buckets but that storage system. Just depends, it can be any storage system that AWS supports we just integrate with it, we’re not the data storage system just like we’re not the, the application layer the BI or the development application layer we sit right in between to manage and govern that data as an interesting environment and gets prepped for its business usage. Yep. And to that point. One of the things that DNC platform managed services, allows you to do is to understand what is the best storage environment. 

[Alex] And as Clark pointed out many of our clients, even after their first selection and the first experiences as their business changes, they’re able to flex into the new environments that AWS, or Azure or any of the cloud providers offer with a lot more flexibility and speed than traditional capex type environments. Maybe one thing we can talk about, we have questions here about how this type of solution works well with some of the business challenges that many of our clients are experiencing constructs like GDPR data protection and some of the governance. Compliance challenges that we’re seeing in relation to our data management in general. 

[Clark] That’s a, that’s a great use case and I would say that one of the industries where we’re strongest in is finances and they have one of the most rigid compliance and regulatory requirements of any industry. And so, that goes to a lot of different capabilities, you’ve got to be able to identify the PII data you’ve got to know where it is and be able to find it and validate it, and so we use data quality rules that are integrated into the processing application. In order to take it right to the data, don’t bring the data to us rather we push that application processing to the data for efficiency to identify whether the data is sensitive data or not like a credit card number or social security number or what have you. And then we provide both masking and tokenization, as though masking is a less rigid form users might use that to like hide, let’s say, a partial birthday, where we want to expose the year potentially for some sort of analytic segmentation down the road but we do want to you know obscure the full birth date, rather than having to you know completely transform it or if we do need to go more rigid and hide a social security number, masking isn’t as helpful there because then we lose the cardinality of it and so as we get down segmenting on social security number has been masked as much, much harder last four digits is not very useful. So, we provide multiple ways to secure that data so identify and secure that data, and then work to incorporate with different encryption algorithms and patterns that are available in the environment for system storage. So, the data is fully secure, and then finally at the reporting of it you know we’ve got to be able to say, for GDPR compliance or, or any type of regulation you know where where did the data come from, how did you change it and where did it wind up, and so we have full lineage available in this in the Zaloni tooling to allow us to trace that completely. Yeah. And the idea here is that, 

[Alex] especially with GDPR where compliance has a penalty associated to it. And there’s a heavy reliance on monitoring capabilities that lineage, that transparency, but also the ability to provide guidance to your users to ensure that your users their trust is educated and protected by how they use the data as your company policies dictate is really important and that common window pane of access allows you to create not only secure limitations on access but also provide guidance as to how best to make use of that data. One of the things that I really like about the combination that with the experience alone the partnership is that it really exposes in control, to a broader user environment so you have things like collaborative crowdsourcing contributions towards metadata. So, curation is no longer an IT problem or an individual or even like a small group of problem it is essentially a capability that the entire enterprise can participate in. And that is, I think, the overall goal and maybe even close out to this message. The overall goal for these type of distribution of platforms and implementation solutions, is to get the entire enterprise engaged, as opposed to letting the enterprise be governed by the limitations of any one division of it, whether it’s by it or maybe by the, even by some of the business units that may only have partial visibility or knowledge about the data, or the technologies. This is the first time we’re seeing the ability to truly engage the entire enterprise. Get them to participate in operating the technology, because it no longer takes a PhD, or electrical engineer to navigate from infrastructure to essentially make use of that as Clark pointed out, embeddedness or making it actionable. Otherwise, knowledge by itself is not really useful. You have to be able to make decisions on it that can be acted on, and that action has to be valuable to the enterprise, and it can be monitored for success to know that that action was truly valuable and change or enhance or say something about the way you do your business partner, anything to add. 

[Clark] That’s a great wrap up I think it’s all about having a collaborative strategy, and you know that goes towards the tooling of being able to have a unified environment to work in where everybody can, you know, be singing from the same song sheet they can be working from the same application and doing the same data, but also being able to hand off those tasks. Everybody’s bringing different skills to the game here we’re beyond the days where filling out requirements and hanging them into it and having a you know project plan to do that is going to be enough to get us to insights quickly everybody needs to be working together with a toolset that reflects their particular skills and so having that self service data strategy, I think, in my experience over the last five to seven years has been key for organizations being able to achieve. Quick insights. 

Well, thanks. I think there’s lots of information that we shared and what I wanted to do is just kind of show you two different types of approaches that are addressing the same problem but maybe some three different patterns that your organization is more than likely to fall into at least once or twice. Throughout its digital transformation journey. Do you have more questions we invite you to look at some of the DXC and Zaloni self service analytics platform discussions we are there’s a link posted here at the end slides, and also in the attachments and links portion of this webinar. There’s a couple of blogs on the topic that talks about how to prepare yourself for artificial intelligence in AI, as well as some of the benefits and approaches towards democratizing data access. And of course, there’s lots of information from the Zaloni resources on their site about case studies, and how the data catalog construct really changes the paradigm of operating on navigating your data treating it more of a data asset. And then in the contract information. there’s a DXC and there’s Zaloni link for further contacts with questions and discussions.