Why you might want to use Delta Lake as your go-to big data storage format; and how Arena can help you get there.

Introduction to Delta Lake

Like other aspects of technology, storage formats continue to evolve and bring new benefits to an organization’s ever-growing technology stack.  One format that has recently gotten a lot of buzz in the big data world, and for good reason, is Delta Lake.  Delta Lake, or simply Delta, is an open-sourced storage layer based on Parquet. It brings with it all of Parquet’s advantages like columnar storage, improved performance on filtering, and support for nested structures, however, Delta Lake is different then Parquet in that it solves some of the common challenges found when dealing with data outside of a traditional RDBMS.  

Traditionally, the downfall of many big data projects has been the lack of data consistency and quality.  This unresolved issue has often led to the data lake becoming known as a data swamp. Unverified, mismatched, out of date, and unversioned layers of data filling an S3 bucket somewhere have led to a distrust in data and chaos for data users trying to find and access the correct information. Delta Lake brings fully ACID transactions to the open-sourced big data world.  Now users have the power to ensure all data in their data lake is consistent, versioned data.

In this blog, we’ll review what Delta Lake is and why it’s making such a splash in the data lake world (*pun intended).  We’ll then discuss why the benefits of Delta Lake pair so well with Arena.

Delta What?

Created by Dr. Michael Armbrust (Father of Spark SQL) and his team at Databricks, Delta Lake at its most simplistic level is Apache parquet that logs all of its own metadata. This, of course, is a drastic oversimplification, however, because of this property, it means that Delta Lake shares much of the same familiarity and compatibility that can already be found in parquet. Of course, saying it simply logs its own metadata does not do Delta Lake justice because the metadata kept is what provides its amazing capabilities.  These logs allow Delta Lake to be fully ACID eliminating some of the biggest issues currently found in most data lakes.  With Delta Lake, data stored in data lakes can now function more transactionally, as you might expect from RDBMS, while still being stored in a distributed big data like fashion. Delta Lake’s transaction logs also allow for schema enforcement (and evolution as necessary), time travel, and audit history of the data.  Perhaps the best part of Delta Lake is that it was designed to be fully integrated with Apache Spark. This integration makes working with Delta Lake extremely simple and allows users to replace the word “parquet” with “delta” whenever writing Spark code. Finally, as you may suspect for a big data storage format built from the creators of Spark, Delta Lake is designed for the big data world and efficiently handles scaling all that metadata such that Petabyte sized tables can still be easily handled using Spark’s distributing capabilities.

delta lake

Figure 1 – Using Delta Lake with Spark [1]

Knowing all of these great features, there are plenty of reasons why an organization might consider using Delta Lake. The creators have made it extremely easy to shift your data from Parquet, or other formats, over to Delta Lake using Spark’s native abilities.  Couple this incredibly simple transition with the ability to provide a stable transaction history that improves data integrity and the ability to roll back any unwanted data changes, and your data swamp can soon be a data lake that would make Fiji water look dirty.

Arena + Delta Lake = A Match Made in Data Heaven

With Zaloni’s Arena platform and Delta Lake, users can now unlock all the benefits that have eluded the promises of the data lake for so long. Arena allows the user to search a catalog of data, regardless of source, and quickly access the data they need and provision it out to a sandbox environment in a controlled and governed manner. Now with Delta Lake in the mix, users can rest assured that the data in the lake follows appropriate schemas and is of good integrity, especially when dealing with datasets coming from high-velocity systems with high transaction rates.  For those worried about the transition to Delta Lake, as noted above Delta Lake is fully compatible with Apache Spark, so the transition is a straightforward operation, as Arena allows users to add their own spark operations directly from the UI. This means reading and writing to Delta Lake is extremely simple and can be easily integrated into new or existing workflows.

Arena also compliments Delta Lake nicely with its built-in zone-based governance.  As denoted in Delta Lake’s own documentation, the concept of zone-based governance is quickly becoming an expectation in data lakes, and one that Delta Lake aims to support.  The transactional integrity of Delta Lake pairs nicely with the concept but does not address how zones often feel ethereal to the users. Arena’s deep incorporation of zones provides a tangibility to the users.  Arena treats zones as a top-level item, such that users can be granted access by zone, or filter their data by zone.

 

governance architecture

Figure 2 – Example zone governance found in Arena

Conclusion

With Arena and Delta Lake, users can finally realize the many promises of the data lake that have been so elusive.  Data scientists and business analysts can quickly search and access quality data without spending  50-80% of their time finding, cleaning, and prepping their data.  Data Stewards and Engineers can create workflows that move and transform their data quickly while ensuring the integrity of the data stays consistent and valid. Zone-based governance can be implemented in a consistent and tangible manner.  These are just a few of the many reasons Zaloni customers love the marriage of Arena and Delta Lake. If you’re interested in learning more about how Arena and Delta Lake can help solve your DataOps problems, fill out our contact us form, and request a personalized demo today. 

 

Sources:

[1]  https://www.slideshare.net/databricks/making-apache-spark-better-with-delta-lake

References & Additional Reading: