November 30th, 2020
Excerpt from report, Managing the Data Lake: Moving to Big Data Analysis, by Andy Oram, editor at O’Reilly Media
Why do you need to know what metadata to preserve about your data? Reasons for doing so abound:
Ben Sharma, CEO and co-founder of Zaloni, talks about creating “a single source of truth” from the diverse data sets you take in. By creating a data catalog, you can store this metadata for use by downstream programs.
There are three types of metadata, according to Zaloni:
The next question is how to create metadata. Many tools can extract the easy stuff, such as file sizes and timestamps, as the stages of processing proceed. Other metadata requires custom-written programs that do such things as tag particular data fields you’ll want to extract later.
At any stage of processing, you may choose to update the metadata. Each stage can also consult the metadata when applying rules for user access, cleaning, and submitting data to jobs. We’ll see later how, at least in theory, storing feedback in metadata can create an environment of continuous quality improvement.
Currently, one of the huge challenges in data management is communicating metadata to downstream parts of a workflow. A good deal of the Zaloni Data Platform’s benefits rest on its ability to do this conveniently.