November 12th, 2020
Spring Cloud Data Flow (SCDF) is an open-source Java-based cloud-native toolkit developed by Pivotal (VMWare) to orchestrate data integration, real-time data streaming, and batch data processing pipelines by stitching together spring boot microservices that can be deployed on top of different modern runtimes like Cloud Foundry, Kubenetes, YARN, Mesos, etc. in addition to a local runtime. Data pipelines deployed using Spring Cloud Data Flow consist of Spring Boot apps built using Spring Cloud Stream or Spring Cloud Task microservice frameworks.
Spring Cloud Data Flow was designed to be the single replacement to the question mark (?) shown above. It’s a single toolkit that developers can employ to create, orchestrate, and refactor data pipelines through one programming model to address common use cases such as data ingestion, real-time analytics, and data export/import across popular source and destination systems. SCDF allows developers to interact to define and deploy data pipelines through multiple endpoints like:
The core of the SCDF ecosystem is the Data Flow Server, a Spring Boot based microservice application that provides the main entry point to define data pipelines in SCDF through RESTful APIs and a web dashboard. This server is responsible for parsing the stream and batch job definitions based on a Domain Specific Language (DSL). The server requires a relational database to persist the metadata related to stream, task, or job definitions and register artifacts such as additional library jar files or docker images used in the pipeline definitions. The data flow server can deploy the Batch Jobs to one or more supported runtime platforms.
The skipper server in the SCDF ecosystem is responsible for deploying the streaming data pipeline definitions to one or more supported runtime platforms using the Spring Cloud Deployer family of libraries. It is a Spring Boot based microservice that behaves as a package manager that installs, upgrades, and rolls back applications to one or more runtime platforms using a blue-green deployment strategy. Just like the Data Flow server, it also exposes RESTful APIs to access the functionalities it offers for stream deployment and application management.
Both the Data Flow server and the Skipper server use a relational database to maintain metadata and the state of jobs, applications, and streams, etc. Supported databases are H2, HSQLDB, MySQL, Oracle, Postgresql, DB2, and SqlServer.
Data Flow Shell Client is an optional component that provides a command-line interface for interacting with the Data Flow Server. It helps with common functionalities of Data Flow Server such as deploying or uninstalling an app or task and creating/deploying a pipeline.
There are two types of applications that are supported by Spring Cloud Data Flow.
These applications are developed using the Spring Cloud Task framework that records lifecycle events (such as the start time, end time, and the exit code) of the application into the relational database attached to the Data Flow Server. These applications can also be developed as Spring Batch jobs since Spring Cloud Task is well-integrated with it.
A short-lived application is registered with Data Flow using the category name task to describe the type of application.
Many source, processor, and sink applications for common use-cases (e.g. s3, jdbc, hdfs, http, router) are already developed and provided as publicly consumable pre-built applications by the Spring Cloud Data Flow team. A developer can directly use or extend any out-of-the-box utility applications to cover common use cases or write a custom application using Spring Cloud Stream.
The Spring Cloud Stream framework provides a programming model to simplify the writing of message-driven microservice applications connected to a common messaging system. This enables the developers to write core business logic that is agnostic to the specific messaging middleware that is realised by adding a Spring Cloud Stream Binder library as a dependency to the application.
A DSL for stream data pipeline definition in SCDF follows the Pipe and Filters architectural pattern. This simple architectural style connects several components that process a stream of data, each connected to the next component via a pipe. The most popular adoption of the pattern is the Unix shell, and a stream data pipeline is defined by using a Unix-inspired pipeline syntax. The syntax uses vertical bars, known as “pipes”, to connect multiple commands. For example the command
% cat input.txt|grep “text”|sort > output.txt
reads the input.txt and pipes it to the input of the grep “text” which processes the content to search for the pattern “text” and then passes the output as input to sort, which sorts the results and outputs into the file output.txt. Each | symbol connects the standard output of the command on the left to the standard input of the command on the right.
In SCDF, the Unix command is replaced by a Spring Cloud Stream application and each pipe symbol represents connecting the input and output of applications over messaging middleware, such as RabbitMQ or Apache Kafka.
Applications of both the types (Spring Cloud Task or Spring Cloud Stream) can be packaged in two ways:
They are registered to the Spring Cloud Data Flow through the Data Flow Server.
A messaging middleware service is required to facilitate communication between applications in a SCDF pipeline. The framework provides a programming model that allows support for pluggable message binder libraries. The currently available binders support following messaging broker services:
The applications in a data pipeline require a runtime environment to execute. Some of the common runtimes that are supported out of the box are Local Server, Cloud Foundry, Apache YARN, Kubernetes, and Apache Mesos.
The following figures show the different components of the Spring Cloud Data Flow framework and how they interact with each other.
This article introduced a Spring Cloud Data Flow toolkit that helps developers set up cloud-native microservice-driven data pipelines to address common pipeline implementation challenges. We have explained all the core components of the Spring Cloud Data Flow ecosystem and presented an architectural overview on how the components interact with each other at runtime. This is the first blog in a new series on Spring Cloud Data Flow. In the next blog, we’ll cover how to build an end-to-end data pipeline using Spring Cloud Data Flow and discuss the operational efficiency introduced in the pipeline by leveraging the metadata management capabilities of Arena.