Building an enterprise data lake is a complicated process. Data organization and data governance tend to quickly become a challenge when ingesting, refining/transforming and producing massive amounts of data across different teams within an organization. If un-managed, data lakes can become data swamps, making data democratization impossible.
Zaloni’s Arena platform provides embedded zone-based governance and reference architecture for data management known as EndZone Governance™. It does so by providing a framework for zones as a first-class metadata and governance attribute for all datasets stored in a data lake. Data Producers and Data Stewards can organize the data as it flows through different steps such as ingestion into a raw area, executing data quality on ingested data to create trusted data, and transforming or refining data for end-user consumption.
Platform owners can take advantage of zones by appropriately assigning security policies on each zone so that it becomes simpler and predictable to manage granular access and governance policies across the organization.
To read more about Arena’s EndZone Governance architecture download the white paper.
Zones can be assigned as part of entity creation or to existing entities. Arena also provides capabilities to easily assign zones dynamically as part of building a data pipeline. Arena provides a base path for HDFS, S3 Bucket, or ADLS for each zone, taking away the guesswork for end-users trying to decide where to create a new dataset. Using zones as part of the data pipeline process not only improves pipeline efficiency but also ensures governance compliance.
The rest of this post will provide a walkthrough of the steps of using zone assignment while building a data pipeline for executing data quality on a dataset.
Step 1 – Create a new Zone for your data quality output (Optional – if it doesn’t already exists)
Please Note: This operation is typically done by platform or project administrators who have the permission to create and manage zones.
Navigate to Control -> Zones and add a new Zone
Zones can be created on HDFS, S3, and ADLS data stores. Users can optionally provide a base path for the zone. By switching on “Match Base Path with Entity Location,” any end-user of this zone will be prohibited from creating a new dataset that does not comply with the zone policy.
Step 2 – Create a DQ Workflow
Navigate to Consume -> Manage Workflow to create a new workflow. Users can also the quick link menu in the masthead to create a new workflow.
Drag the start/end control actions and data quality action to design the workflow as shown below.
Step 3 – Configure the DQ Action
- Click on the data quality action to open the slide-out to configure this action.
- Start by selecting the raw entity in the input properties using the auto-suggest feature.
- Arena intelligently, populates the output properties – such as entity names, schema and output path for GOOD, BAD, and Report data. The user can further modify these.
- Choose the zone for your GOOD, BAD and Report datasets. Notice that Arena automatically populates the output path based on the entity name, schema, and base path of the zone. When the zone policy for match base path is turned on, users will not be allowed to input any path outside the base path defined in the zone. The zone policy ensures that the data is always created in the right directory structure and complies with governance policies set up by the platform.
- Save the workflow and execute on-demand or by creating a schedule or associating with an ingestion event as a post-ingestion workflow.
In conclusion, assigning zones within the data quality process saves manual steps for data stewards and reduces the time it takes to make data available for analysts. Entities that are created during the data quality process are categorized and located in zones that comply with your governance policies.
If you’d like to see a live custom demo, visit www.zaloni.com/get-a-demo to schedule one with our team.