Since the rise of the Data Lake and related Big Data technologies (such as Hadoop, Spark, Hive, etc.), many organizations have seen technical benefits when augmenting their traditional data warehouses with these newer technologies, allowing them to process more exotic kinds of data than relational tables and to implement more complex data processing pipelines not supported by traditional relational algebra and SQL. Data Lakes are hard to successfully implement not because of the technologies they use, but because of how many existing processes and organizations they disrupt. In the first part of this series, learn a new data mesh architecture that offers an iterative method to set up a data infrastructure.
From a technological perspective, this has been a great success, yet while implementing Data Lakes in the enterprise, we as an industry have seen mixed results. Augmenting a Data Warehouse with non-relational Big Data tools such as Hadoop has been difficult for many teams. Zaloni often partners with organizations to provide the technology and expertise to build a Data Lake correctly. In this blog, I’ll cover observations on where Data Lakes falter and some architectural changes you can make to avoid them.
Data Lakes aim to pull all unprocessed data into one unified system. Initially, ingestion pipelines are set up to gather disparate data from various sources – existing data warehouses, OLTP-based applications, streaming queues, files, and logs – to be stored on a large, cheap, and distributed storage system such as HDFS or S3. This environment enables users to explore and write analytics jobs that drive better insights from the data. I often call this “collect now, process whenever.” Having cheap storage is key to this approach. This type of data hoarding is not a bad practice – after all, many novel insights are found when you can review large amounts of historical data or when you can combine data from seemingly unrelated sources.
Day-2 operations in a Data Lake
The problem with such a model is that keeping data updated can be challenging. For example, if you took a large data extract from a data warehouse (such as Teradata DW) and put it on your Data Lake (HDFS), you still need to update the HDFS files periodically to capture changes. There are a lot of technologies that can help you achieve this (Golden Gate, DEbezium, Hudi, Delta Lake). Still, they do not solve one core problem: if you run an update process every day to bring in data into the data warehouse, who pays for the associated cost? The data warehouse must run retrieval queries, there’s data flowing through the network, and most essentially these compute cycles could be used by the data warehouse team on other work, such as nightly batch jobs, deduplication, optimizing indexes, etc.
In such cases, if the team or organization managing the Data Warehouse is different from yours, the one managing the Data Lake, there is a minimal incentive for the DW team to let you pull data from their warehouse. In your case, if you intend to “collect now” and “process whenever” (as I described above), you may not have a solid use case or reason to justify the costs of collecting the data. This friction between systems and teams often creates a resistance to share data and therefore stalls many Data Lake transformation projects. Only organizations with a highly aligned vision can absorb the costs of setting up these systems, hoping for an eventual payoff.
As you can see, this isn’t a problem of using the right tool for the wrong job. It’s quite the opposite. Some new software tools have tried addressing these very issues by changing the cost structure of ingestion. For example, if a team uses S3 on Amazon Web Services to store data, they can configure the cost structure so that the account accessing the data pays for data transfer costs, instead of the account who owns the S3 bucket. Many other tools such as Databricks, Snowflake, and Zaloni’s Arena can take advantage of this. Setting up a chargeback is a very effective way to solve this problem. In fact, a lot of the service API driven model of organizations also work well with this.
Chargeback, however, does not work well in two places:
(1) It doesn’t work well on traditional data warehouses when you’d like to move very large amounts of data, not via a REST API.
(2) Even if you have a chargeback model in place, expensive data onboarding exercises have large upfront costs before you see insights from that data. Migrating from decades of legacy systems and resistance from legacy culture are difficult problems to solve when you don’t have immediate benefits to show for all the effort of setting up a Data Lake.
Data Meshes: Start small, grow organically
Let’s describe a different way of proceeding with such projects. A process that allows us to move iteratively, to have access to diverse data without having to pay upfront for ingesting data. This architecture is called a DATA MESH.
The great folks at ThoughtWorks describe this process in some detail, but I’ll attempt to point out this paradigm’s key points.
- Create a distributed mesh of data.
- Organize teams around the Data Domain, not tooling.
- Use a data Platform that melts into the background.
Create a distributed mesh of data products
The first key thing to remember is that your current Data Processing solution does not need to be replaced with a shiny new Data Mesh. Instead, set up a unified catalog of all the data products in your organization.
There are many existing Data Catalogs on the market, most of which would be a great choice to set up a Data Mesh. Your intention should be to make various datasets (in data warehouses, data lakes, queues, and all other varied sources) easy to discover. Data Catalogs tie in various data products across the organization without moving the data and having to pay for physically centralizing the data. Therefore, this is also a very good solution for organizations with legacy systems. You can have a unified view of the data without making a large investment.
There are two critical factors to keep in mind while setting up a data catalog:
- Although it centralizes just the names of various datasets and makes them easy to find, the catalog must also make it easy to access the underlying data if you need to. Therefore systems that have features such as data sandboxing, provisioning, and raising data access requests are preferred.
- The amount of effort required to include a data source into the data catalog should be invariant of the data’s size. Whether its a tiny text file or a multi-petabyte Hive Table, it should only take you minutes to include it in the catalog.
Organize teams around the Data Domain, not tooling
Modern Data Lakes and Data Warehouses require a team with specialized knowledge, which is not easy to procure. I can attest to that, having spent many years training data teams to look beyond SQL to use newer Big Data capable tools.
However, with too much focus on tools, we make the error of organizing data engineering teams around tools. It is not a pretty sight to see ETL engineering teams who only specialize in Data Ingestion into S3 or data export into reports. This focus on tools prevents us from building institutional knowledge about the data we process; instead, only a few select people in the organization understand the structure and intent of data. We have moved away from domain-oriented data ownership to a centralized domain agnostic data ownership.
This type of organization is an anti-pattern. Instead, let’s take a leaf from Domain-Driven Design, which see applied successfully for enterprise application microservice architectures. If we organize teams on data domains, we tend to foster a richer understanding of datasets and what they mean.
Setting up a Data Catalog helps in this endeavor. A data Catalog provides a team with a 30,000 feet view of all the places the same domain of data is used – from OLTP systems to S3 archives.
You might think at this point, well, if I organize teams with domains in mind, who will manage the ingestion, the transformation tools, who will specialize in writing my MapReduce jobs?
There are two answers to this:
- Instead of having a cross-domain team when it comes to data, create a polyglot team that has members who can work with ingestion, Spark, and Kafka.
- Consider investing in tooling that doesn’t require a lot of low-level programming to implement patterns that are essentially decades-old – ingestion, transformation, and loading.
Use a Data Platform that melts into the background.
Once you have a Data Catalog with a comprehensive view of the various datasets available and you’ve organized your teams to manage domains within this data the next step is to invest in a platform that does not require a lot of cajoling to work correctly.
Consider for a minute, the goals of your Data Engineering Organization – would you want to invest time building a data platform, a Hadoop Cluster? Is that where your organization’s core competency should lie? Or should you focus instead on the Data Domain?
The rise of many modern, scalable data platforms as a service has made it much easier to set up data engineering platforms than in the early days of Big Data. I especially recommend tools that are used on a web browser, have easy-to-use REST API’s so they can be extended, and leverage the cloud (public or otherwise) to take advantage of their flexible costing structure.
Rolling out an effective Data Platform is difficult because of existing legacy systems, processes, and groupthink. Instead of attempting to replace these, the Data Mesh allows you to enmesh (pun intended) legacy and new systems to get a complete view of your data assets and organize teams that improve institutional knowledge of data domains.
Tell us what you think about these kinds of architectural approaches or questions you have about data lakes, data mesh or DataOps. Reach out to us on Facebook or LinkedIn!
(Zaloni’s Arena can help your organization set up a Data Mesh. In my next post, I will describe an iterative plan to use Arena to set up a Data Mesh)