Data lakes fundamentally changed how our customers viewed data management and data processing for their big data use-cases. As technology stacks have evolved, there are many options available beyond HDFS/Hive for customers to consider when building their data platforms.
This has caused a shift in the data lake paradigm where organizations are defining data lakes in a more logical. Data can reside in different data stores like distributed object stores (S3, Azure Blob Storage, etc.) and cloud and on-prem data warehouses (Snowflake, Redshift, SQL Server, etc.).
Data consumers (Business/Data Analysts, Data Engineers, and Scientists) often struggle to operate efficiently in these environments due to several factors:
- Lack of unified metadata and a data catalog to discover and understand the data.
- Extracting data is time-consuming and often requires help from IT/platform teams.
- Lack of automation and process for data consumers to contribute back to the unified data catalog.
- Lack of defined governance processes to create sandbox environments for analytics and reporting use-cases.
Arena, the DataOps platform by Zaloni, has taken a unique approach to solve this problem by enabling governed, self-service access to data, regardless of where it is stored.
We took inspiration from existing e-commerce marketplaces, like Amazon, to build an enterprise-wide data marketplace where data consumers can search and discover datasets and provision data to a sandbox environment for further analytics and reporting. Under the hood, Arena performs the “fulfillment” of data to the sandbox environment.
The rest of this post will provide a walkthrough of the steps to performing data provisioning using Arena’s built-in data marketplace.
Step 1: Discovering Data
Use the Global Search feature to discover and understand datasets in an enterprise-wide data catalog. You can search by any metadata information at the dataset and column level, such as business attributes, technical metadata, or associated metadata, such as data quality rules.
Users can further refine their results through filtering by project, zone, datastore, etc.
Step 2: Add Data to Shopping Cart
Once users have discovered the dataset(s) needed, these datasets can be added to a shopping cart from the Dataset Details page or a quick action in Global Search. The shopping cart can contain mixed types of datasets, e.g., datasets stored in a data warehouse and an S3 object store.
For a relational database dataset, users can provide a JDBC connection that Arena utilizes for reading the data in the provisioning process. Using the catalog metadata, Arena intelligently narrows down the list of available connections to simplify choosing the right connection.
Users have several options to filter the dataset they’d like to provision further. Available options are:
- Row Level
- By default, all rows
- Sample set by row count or percentage of rows
- Where clause expression builder
- Column Level
- By default, all columns
- Selected columns
- For advanced use-cases, Arena enables users to provide Spark SQL editor to provide a query that can filter the required data.
Step 3: View Cart
To view datasets in your cart, simply click on the cart icon or navigate to Consume -> Shopping Cart.
You can select all or specific items before initiating the checkout process. Click Provision to begin the checkout process.
Step 4: Cart Checkout
Users can provide a description and additional information for any custom metadata that has been configured on the Arena instance. For example, in the image below, “Intended Use,” “VM Details,” “Tools Used,” and “Data Lease Duration” are custom fields that help define the sandbox environment and data required for Data Owners to approve this request. You can read more about the approval process here.
Users can choose between several options available for their destination: Hive/HDFS, S3, ADLS, Relational Databases, SFTP/FTP, etc.
For each destination type, users can choose the appropriate connection for Arena to write data.
Users can also choose to contribute back to the data catalog by simply adding the output of provisioning to the catalog so that the data is now discoverable by other data consumers.
The last step is to review the provision checkout summary, with options to enable notifications for success or failure of the provisioning process.
Finally, users can monitor and view details of the provisioning by navigating to Monitor -> Provisions.
In conclusion, Arena provides a unified data platform that can augment the data catalog so data consumers can quickly and easily consume data in a governed fashion without waiting on traditional enterprise processes, reducing the reliance on IT teams, and accelerating the time to analytics.
Contact us today if you’d like to learn more or get a custom demo!