Blogs

Data Governance and Data Science, Working Together

Avatar photo Team Zaloni December 14th, 2018

Data science is highly dependent on automated governance processes. In fact, data science workbenches out of necessity include a lot of manual, “think time” intensive processes that require their own dedicated discovery sandboxes. However, these data discovery sandboxes offer even more value if they are easily merged into an automated end-to-end process between back-end governance and self-service engagement by analysts.

To further understand the benefits, let’s consider governance and engagement from the standpoint of how each of their processes affects business usage.

Governance Processes for Business Analytics

Data governance has for several years been driven by the needs of top-down analytics, where the schemas and data models were determined ahead of time and meant for historical reporting, strategic insights, and operational performance.

As such, data governance (quality, lineage, enrichment, security, etc) has been more of a linear process pipeline with a well known, yet often complex, path from raw to “finished good” datasets that can be released for broader use.

Data governance has never been a simple process, especially in the context of an enterprise data warehouse. With its predefined standard entities and schemas, there were upper limits on the complexity. Plus, you also you had benchmarks with which to measure performance and risk as far as data preparation and management was concerned.

Today, this linear model is struggling and must become much more interactive and iterative. The terms “data science” and “self-service analytics” are used together frequently. Out of necessity, data science requires analysts teams’ be able to access raw information to tag, enrich, test, and once vetted re-enter it into business processes for measurement and refinement.

Engagement vs Governance for Data Science

With the arrival of big data came the arrival of massive volumes of “capture now, transform later” data. The governance pipeline can no longer be restricted to a predefined set of entity models to work with or even based on the defined types of the enterprise data warehouse (EDW) or other analytics consumers.

The overall data pipeline must evolve more quickly to the needs of the data science teams that engage with the data. These teams collect, annotate, evaluate, and analyze the highly variable data from outside sources like consumers, customers, suppliers, and other partners. In addition, the target audience analysts need to have more control over how they can access raw data and build their own semantic entities, often based on ad-hoc needs that are unanticipated by the core IT team managing the existing EDW.

Integrating Processes to Support Modern Workflows

Given what we now know about the evolution of governance and engagement, there should be ways for the two different pipelines to be integrated, in both directions, in a controlled, traceable manner:

  • Governance: automation and visibility for quality checking, field-level privacy rules, lineage tracking, auditing/logging, provisioning, etc
  • Engagement: multiple stakeholders accessing data, sharing data, detailed work-in-process data formatting, vetting the results, eliminating bias from the data, adding new data types, tracking the downstream impact of the data models, etc

Once the artifacts created by data scientists leave the confines of the sandbox, these new artifacts need to be promoted as full members of the integrated governance and engagement pipeline.

The workflow management needed to support this integration is different and needs to be separate enough to offer flexibility, but also needs to be joined together at the right times in order to bring data science artifacts into the overall governance process.

Consider how these artifacts benefit from being part of a top-level governance process:

  • Training data lineage, profiling, quality, auditing: for example training data, like any other entity, can have their own unique quality issues such as bias (intended or not) that can cause subtle machine learning problems deep down the value chain. These types of errors can be quite damaging to business goals and very difficult to trace back and resolve.
  • Models and algorithms lineage, profiling, quality, auditing: just as with other work products from the data science teams, models require lineage and provenance that prevents entry of errors into the data value chain. Once inserted, these can be tracked and evaluated as part of the broader business analytics process.

Governance and Engagement are Better Together

As mentioned, data scientists and analysts create new data entities as requested by each analytics audience as these teams parse the massive data streams and ask new questions about it. Data science teams not only need to work with existing entity data sets, maybe in raw format or work-in-process format, but also create new entities in the form of new datasets for machine learning training, in the new transformations for predictive processing, new algorithms, Spark Notebooks, and many more.

These new data entities must be seamlessly incorporated and managed as part of an overall governance pipeline. Moreover, a governance pipeline that is flexible and configurable enough to treat each of these entities in the most effective individual manner.

Zaloni Arena DataOps platform Accelerates Governed, Self-Service Analytics

For many customers, the Zaloni Arena has provided the capabilities to support this linear pipeline staging, tailored to each of the different ingested data types, by providing a persona-based, Zone-based value chain (raw to finished goods for each target consumer), a rich transformation workflow layer, and lifecycle management for data based on age and frequency of access.

Arena provides these capabilities to merge and orchestrate the data science workflows with data governance workflows:

  • A massively scalable ingest cluster to pull in large amounts of raw data, any format, and then feed the data into a zone-based architecture for managed, traceable promotion of the data. The datasets offered in the role-based user catalog are based on trusted enriched data that is targeted for each catalog project and user
  • Powerful workflow and transformation cluster, that can provide a tailored, audited pipeline for each data type from raw to finished, including the sending and/or receiving the data science workflow artifacts as intermediate data assets. The workflow capability is open and includes interfaces in REST, JDBC, HIVE, SparkSQL, Python, Scala, and several more
  • A self-service provisioning shopping cart interface, so that teams can partition, filter, and provision data assets on demand. This provides a way to keep track of the data science assets; via use of Arena’s REST API the enterprise can also insert other data workflow management products and processes into Arena as part of an end-to-end Governance and Engagement process.

If you’ve been exploring ways to confidently provide governed, self-service data access to your business, you’ll want to chat with us about how our customers are doing this today.

Contact us for a demo of Zaloni Arena.

data governance

about the author

This team of authors from Team Zaloni provide their expertise, best practices, tips and tricks and use cases across varied topics incuding: data governance, data catalog, dataops, observability, and so much more.