March 2nd, 2017
Common sense tells us one can’t use data unless its quality is understood. Data quality checks are critical for the data lake – but it’s not unusual for companies to initially gloss over this process in the rush to move data into less-costly and scalable Hadoop storage especially during initial adoption. After all isn’t landing data into Hadoop with little definition of schema and data quality what Hadoop is all about? After landing data in a raw zone in Hadoop the reality quickly sets in that in order for data to useful both structure and data quality must be applied. Defining data quality rules becomes particularly important depending on what sort of data you’re bringing into the data lake; for example, large volumes of data from machines and sensors. Data validation is essential because it is coming from an external environment and it probably hasn’t gone through any quality checks.
Existing users of Hadoop who may already have data in a data lake that hasn’t gone through a data quality process as a standard operating procedure, don’t worry. There are a number of best practices for validating data, whether you’re still planning for a Hadoop implementation or you already have a data lake. And, no matter what stage of data maturity you are in, you can leverage the processing power of Hadoop to run your data quality checks, while leveraging the natural parallelism of Hadoop along with its financial benefits.
First, what do we mean by “data quality?” Data quality in Hadoop is not the same as data quality in a traditional data warehouse, where partial records are often rejected. One of the benefits of Hadoop is that you can keep all of your raw data in its native format and use or transform the parts of data sets that pass a quality threshold for a particular use case. For example, a data set may not have complete address information, but is still useful because it contains the zip codes needed for an analysis. A useful way to think about it is that data quality in Hadoop isn’t always about cleansing data to fit a particular schema; instead, it’s about evaluating the data to know what you have and then determining later if it is useful for a particular use case. This becomes especially obvious when one considers non structured or semi structured use cases where data quality can take on a variety of meanings especially with binary data for example.
To evaluate data quality at the scale of big data and reduce errors, automation is the key to success. Use of a data management platform to automatically validate data during ingest is the key to moving data from its raw form into a more consumable format for both production use cases or for discovery activities by data scientists. Automation is the key to not just storing data at scale but to making the data useful to the business as fast as possible leveraging Hadoop natural ability to do work in parallel to enable the right time to value.
Use of data quality actions in Hadoop as part of an ETL/ingestion process also allows movement of this process out of the traditional data warehouse to a less expensive, more scalable platform. The basic use of Hadoop in house has been the traditional answer. We increasingly see the use of cloud services and a data lake management platforms, like Zaloni’s, to provide orchestration of data preparation activities across physical, virtual and hybrid cloud environments. This also includes the use of transient clusters like Amazon EMR with data stored in S3 as persistent storage.
Zaloni pairs the use of a data lake management platform with a recommend strategy for zones in the data lake; specifically, landing, raw, trusted, refined and sandbox zones and using rules related to data quality, security and privacy (e.g., masking and tokenization) as a part of the automated movement of data between the zones. Depending on what zone your data is in will indicate the degree of confidence, level of access, or appropriate use for your data.
Data quality processes are based on setting functions, rules and rule sets, which standardize the validation of data across data sets. Here’s a simplistic overview: functions are the most basic (e.g., a number is greater than another number), and can be combined to create rules (e.g., data can’t be null and must be greater than 10). Then rules can be combined to create rule sets (e.g., check all fields and make sure there’s a valid email address). You then determine what validation processes and hierarchy of rules apply to what data or data sets. For example, a simple function (e.g., is this number greater than zero?) may be adequate for some data, while other data may need to be validated by a more complex hierarchy of rules. Often the level of required validation is influenced by legacy restrictions or internal processes that already are in place, so it’s a good idea to evaluate your company’s existing processes before setting your rules.
The most important tip? Automate and standardize your data quality check process as soon as possible. And there’s no need to DIY this technology. You can get up and running relatively quickly with a solution like Zaloni’s Arena DataOps platform that operates natively in Hadoop and provides an all-in-one platform for ingestion and metadata management, including data quality processes like reconciling updated or changed data, and applying file- and record-level watermarking to indicate data lineage. Interested in finding out more? Contact us and we can discuss your needs.