March 2nd, 2017
At times, the search for a perfect data catalog can seem like finding the hay in a needle stack (not only difficult, but painful!). Each stakeholder has equally demanding and disparate sets of requirements for success.
Where business analysts want a slick, refined, and easily navigated UI with easy export capabilities, data scientists might refuse to accept anything that does not allow custom-tailored queries, connections to their favorite notebook, and unburdened access to all of the data that has ever existed in the data lake. Meanwhile, the security group wants none of this! Exposing the data at all is a non-starter.
This leaves you, the tech visionary who has a stable of cutting-edge vendors at the ready and a 5-year rollout plan to go with them, stuck in neutral.
Before you resort to breaking out the floppy disks in protest, we have 5 guidelines for building a successful data catalog and how you can help your business succeed without compromising your stakeholders.
Open access is the foundation of any successful data catalog. The desire for a more intuitive, less burdensome method of accessing available data is when the demand for a data catalog commonly emerges. Any solution that does not address this fundamental point is bound to run into difficulties in gaining business support.
For users, open access provides the value that is sorely lacking from traditional data lakes: efficient, accurate and personalized access to data, no matter where that data may originate.
For administrators, open access can dramatically reduce the overhead that comes from routine requests and maintenance of audit histories and ticketing systems. A good catalog should automate and manage these functions.
Value: Open access to any data owned or used by consumers of the catalog creates fast time-to-value for users and greater efficiency for administrators.
Security in the data catalog often creates a significant conundrum. While providing open, transparent access to data is tantamount to success, this access can also create security threats that will quickly shut the project down. Although security groups at your organization may not have an active role in using the data catalog, they certainly have a vested interest in keeping the organization protected from both external and internal threats. How can you balance these two seemingly opposing forces?
Whether the data environment is on-premises or in the cloud, there are a variety of options to securing data. Cloud vendors, such as Amazon Web Services, provide security capabilities such as virtual private cloud, at-rest and in-flight encryption along with private or dedicated connection options which can be layered with additional data access security.
In the Hadoop ecosystem, projects such as Ranger, Knox and Sentry provide a new level of protection for at-rest data on the cluster, while more tools such as NiFi and Kafka either include support for external security protocols (i.e., Kerberos), or built-in security layers. Assuming we focus on Hadoop-based data catalogs, these tools will take care of most facets of security within the data lake. Any remaining holes will come from external systems, the burden of which should fall on those systems.
Value: Security, whether applied by the catalog or by underlying systems, is integral to continued operation at an enterprise level. Cloud providers and the data systems themselves have several components to address this requirement.
To leverage cloud services as core components of the data catalog, the agility, security and flexibility available through the various components must be explicitly designed for the catalog itself.
Many catalogs may provide connectivity or extensibility to cloud services, but these tools inevitably introduce security holes, functional limitations and additional maintenance points. Even if such a catalog does work today, cloud services are evolving so rapidly that there is no guarantee that it will work tomorrow when an API call changes syntax. In most cases, it is much easier to bring external systems into the cloud rather than vice versa.
Value: By natively leveraging cloud components, data catalogs can take advantage of the power and agility of the ecosystem, without introducing cumbersome, unstable or complex workarounds.
Although it is important to consider compatibility when choosing a data catalog vendor, it is just as important to consider how well the solution will “play” with external systems. As much as any Hadoop or cloud-first advocate may hate to admit it, any enterprise-size infrastructure will include many more systems than its native components.
NoSQL, RDBMS and traditional NFS style systems abound in real-world infrastructure. Without connectivity to these types of systems, a data catalog will only ever capture a fraction of the full potential audience and will always miss out on some of the potential insights generated by a consolidated view.
Fortunately, as long as our catalog is source-agnistic, a plethora of tools exist to cover ingestion and export ranging from field sensors to relational databases. Tools like Sqoop, NiFi, Kafka, Amazon Kinesis and Azure Data Factory are well-established and rapidly evolving to keep up with source systems.
Value: A single storage or source will never hold 100% of your data assets; to create a complete picture, and therefore the most effective and valuable catalog, external systems should be easily connected.
If there is one constant in data architecture today, it is change. Hadoop was the first to change our day-to-day and cloud technologies are proving to be more dynamic than that. Plus, consider all the peripheral technologies that makeup enterprise infrastructures.
In order to be successful, a measure of future-proofing is necessary. While no software platform can predict the future (although certain AI tools may be getting close), a solid framework of stable, public-facing APIs can go a long way towards achieving this goal.
Technology isn’t the only thing that changes, and as your business grows, expands and refines its needs and processes, the data catalog needs to be able to scale out organically as well. This means flexibility in usage and functionality is key.
Value: A catalog that is ready for changes in technology and scope will mean more ROI and less headache as your business inevitably changes.
Building a data catalog can be a long, complex, difficult process and the 5 guidelines above are just a starting point. Although new products, vendors and services arise daily, following these guidelines can help navigate the muddy waters of data lakes, especially in the cloud, and increase your chances of a successful launch.
In the end, the choice of a data platform often comes down to a personal choice for your organization. As long as thought and planning have gone into that choice, you can stop worrying about the many ways that a data catalog can go wrong and start focusing on how to make it work for your business.