March 18th, 2021
The cloud was supposed to make things simpler. Instead, our data landscapes just became more convoluted. Now, data pipelines run in hybrid and multi-cloud environments that create added challenges for data developers. A DataOps approach helps teams cut through the complexity to manage data solutions in a modern environment.
Does this sound familiar? Your organization recently started modernizing its data landscape by moving to the cloud. Outsourcing the hassle of managing infrastructure was supposed to make things easier for the data and IT teams, but in reality, they now have to deal with data stranded across multiple systems. Some departments use Microsoft Azure, others Amazon Web Services, and a few core systems remain on-premises. If you find yourself nodding along, you’re not alone.
The cloud promised simplicity but brought added complexity for many enterprises. Often, instead of replacing legacy systems, the cloud augments them. The data pipelines companies rely on now run across hybrid and multi-cloud environments, increasing the number of places they can break down. These pipelines draw data from dozens of sources across fractured landscapes. Is it any wonder that business users routinely complain about the quality and speed of data and analytics applications?
The cloud promised simplicity but brought added complexity.
The emerging methodology of DataOps suggests a way forward. By adapting the tenets of DevOps and agile software development methodologies for the world of data, DataOps provides teams with a tool kit to rigorously build and maintain data pipelines, even in the cloud. DataOps rests on four key pillars: Continuous Integration/Continuous Delivery (CI/CD), Orchestration, Testing, and Monitoring. (See figure 1.) The last three, in particular, help organizations deal with the unique challenge of developing data pipelines across a cloud environment.
Figure 1. The Key Components of DataOps
DataOps advocates for creating a control center. It recognizes that data landscapes, especially those with cloud components, are too complicated for humans to manage efficiently. Instead, it requires that teams use an orchestration tool to coordinate data, code, and software wherever it may lie. These tools not only connect to systems in the same location, but also to those on other clouds or on-premises. They permit developers to build pipelines that kick off jobs, provision environments and move data across multiple locations from a single platform.
Many orchestration platforms are themselves cloud-native and use systems of APIs and agents to interface with the point tools that perform each step of the data pipeline. Although it’s possible to build your own integrations using an open-source tool such as Apache Airflow, providers such as Zaloni offer pre-built connectors that make it easier to get up and running.
DataOps involves embedding tests at every stage of a data pipeline—both for data quality and pipeline functionality. This extra upfront work makes it far easier to diagnose problems when they arise because teams can trace issues back to the point of origin. When pipelines straddle hybrid or multi-cloud environments, the ability to locate breakdowns becomes even more essential. Without tests, developers have little to go on when things go awry.
Tests also improve end-user confidence in analytics solutions. Because they allow data teams to measure and demonstrate reliability of the data, it’s easier to coax business users away from their siloed spreadsheets and onto the enterprise systems. If the pipeline automatically catches 70% of errors before they reach the data consumer, that’s 70% fewer complaints to eat away at the data team’s legitimacy among decision-makers.
At times paying bills from cloud providers feels like throwing money into the void. With no way to know which applications or users drive compute costs from month to month, it’s difficult to improve efficiency and avoid surprise expenses. Enter pipeline monitoring. Like orchestration tools, monitoring systems interface with every component of your data landscape, but rather than software, they care about hardware.
Monitoring tools record information from the machines that actually run the pipelines, whether in your data center or Amazon’s. This enables chargeback. When your cloud bill arrives at the end of the month, they give you a way to break it down. You can track which pipelines or even which developers are responsible for spikes in consumption, so you know how to allocate costs internally and where to target optimization efforts. In fact, monitoring tools often pay for themselves in savings from the cloud.
The methodology of DataOps in the cloud helps data teams address the unique challenges of building pipelines across hybrid, cloud, and multi-cloud environments. Orchestration, testing, and monitoring—all key components of a DataOps approach to development—each solve a piece of the puzzle. Orchestration helps bring complex environments together, testing makes it easier to find out where things go wrong, and monitoring keeps cloud bills predictable. Once these capabilities are in place you may find yourself with a sunnier disposition even when dealing with clouds.