Building an enterprise data lake is a complicated process. Data organization and data governance tend to quickly become a challenge when ingesting, refining/transforming and producing massive amounts of data across different teams within an organization. If un-managed, data lakes can become data swamps, making data democratization impossible.
Zaloni’s Arena platform provides embedded zone-based governance and reference architecture for data management known as EndZone Governance™. It does so by providing a framework for zones as a first-class metadata and governance attribute for all datasets stored in a data lake. Data Producers and Data Stewards can organize the data as it flows through different steps such as ingestion into a raw area, executing data quality on ingested data to create trusted data, and transforming or refining data for end-user consumption.
Platform owners can take advantage of zones by appropriately assigning security policies on each zone so that it becomes simpler and predictable to manage granular access and governance policies across the organization.
To read more about Arena’s EndZone Governance architecture download the white paper.
Zones can be assigned as part of entity creation or to existing entities. Arena also provides capabilities to easily assign zones dynamically as part of building a data pipeline. Arena provides a base path for HDFS, S3 Bucket, or ADLS for each zone, taking away the guesswork for end-users trying to decide where to create a new dataset. Using zones as part of the data pipeline process not only improves pipeline efficiency but also ensures governance compliance.
The rest of this post will provide a walkthrough of the steps of using zone assignment while building a data pipeline for executing data quality on a dataset.
Step 1 – Create a new Zone for your data quality output (Optional – if it doesn’t already exists)
Please Note: This operation is typically done by platform or project administrators who have the permission to create and manage zones.
Navigate to Control -> Zones and add a new Zone
Zones can be created on HDFS, S3, and ADLS data stores. Users can optionally provide a base path for the zone. By switching on “Match Base Path with Entity Location,” any end-user of this zone will be prohibited from creating a new dataset that does not comply with the zone policy.
Step 2 – Create a DQ Workflow
Navigate to Consume -> Manage Workflow to create a new workflow. Users can also the quick link menu in the masthead to create a new workflow.
Drag the start/end control actions and data quality action to design the workflow as shown below.
Step 3 – Configure the DQ Action
In conclusion, assigning zones within the data quality process saves manual steps for data stewards and reduces the time it takes to make data available for analysts. Entities that are created during the data quality process are categorized and located in zones that comply with your governance policies.
If you’d like to see a live custom demo, visit zaloni.com/get-a-demo to schedule one with our team.
Blogs By: Haley Teeples
News By: Annie Bishop
Blogs By: Matthew Caspento