December 14th, 2018
Data science is highly dependent on automated governance processes. In fact, data science workbenches out of necessity include a lot of manual, “think time” intensive processes that require their own dedicated discovery sandboxes. However, these data discovery sandboxes offer even more value if they are easily merged into an automated end-to-end process between back-end governance and self-service engagement by analysts.
To further understand the benefits, let’s consider governance and engagement from the standpoint of how each of their processes affects business usage.
Data governance has for several years been driven by the needs of top-down analytics, where the schemas and data models were determined ahead of time and meant for historical reporting, strategic insights, and operational performance.
As such, data governance (quality, lineage, enrichment, security, etc) has been more of a linear process pipeline with a well known, yet often complex, path from raw to “finished good” datasets that can be released for broader use.
Data governance has never been a simple process, especially in the context of an enterprise data warehouse. With its predefined standard entities and schemas, there were upper limits on the complexity. Plus, you also you had benchmarks with which to measure performance and risk as far as data preparation and management was concerned.
Today, this linear model is struggling and must become much more interactive and iterative. The terms “data science” and “self-service analytics” are used together frequently. Out of necessity, data science requires analysts teams’ be able to access raw information to tag, enrich, test, and once vetted re-enter it into business processes for measurement and refinement.
With the arrival of big data came the arrival of massive volumes of “capture now, transform later” data. The governance pipeline can no longer be restricted to a predefined set of entity models to work with or even based on the defined types of the enterprise data warehouse (EDW) or other analytics consumers.
The overall data pipeline must evolve more quickly to the needs of the data science teams that engage with the data. These teams collect, annotate, evaluate, and analyze the highly variable data from outside sources like consumers, customers, suppliers, and other partners. In addition, the target audience analysts need to have more control over how they can access raw data and build their own semantic entities, often based on ad-hoc needs that are unanticipated by the core IT team managing the existing EDW.
Given what we now know about the evolution of governance and engagement, there should be ways for the two different pipelines to be integrated, in both directions, in a controlled, traceable manner:
Once the artifacts created by data scientists leave the confines of the sandbox, these new artifacts need to be promoted as full members of the integrated governance and engagement pipeline.
The workflow management needed to support this integration is different and needs to be separate enough to offer flexibility, but also needs to be joined together at the right times in order to bring data science artifacts into the overall governance process.
Consider how these artifacts benefit from being part of a top-level governance process:
As mentioned, data scientists and analysts create new data entities as requested by each analytics audience as these teams parse the massive data streams and ask new questions about it. Data science teams not only need to work with existing entity data sets, maybe in raw format or work-in-process format, but also create new entities in the form of new datasets for machine learning training, in the new transformations for predictive processing, new algorithms, Spark Notebooks, and many more.
These new data entities must be seamlessly incorporated and managed as part of an overall governance pipeline. Moreover, a governance pipeline that is flexible and configurable enough to treat each of these entities in the most effective individual manner.
For many customers, the Zaloni Arena has provided the capabilities to support this linear pipeline staging, tailored to each of the different ingested data types, by providing a persona-based, Zone-based value chain (raw to finished goods for each target consumer), a rich transformation workflow layer, and lifecycle management for data based on age and frequency of access.
Arena provides these capabilities to merge and orchestrate the data science workflows with data governance workflows:
If you’ve been exploring ways to confidently provide governed, self-service data access to your business, you’ll want to chat with us about how our customers are doing this today.
Contact us for a demo of Zaloni Arena.