Spring Cloud Data Flow
Spring Cloud Data Flow (SCDF) is an open-source Java-based cloud-native toolkit developed by Pivotal (VMWare) to orchestrate data integration, real-time data streaming, and batch data processing pipelines by stitching together spring boot microservices that can be deployed on top of different modern runtimes like Cloud Foundry, Kubenetes, YARN, Mesos, etc. in addition to a local runtime. Data pipelines deployed using Spring Cloud Data Flow consist of Spring Boot apps built using Spring Cloud Stream or Spring Cloud Task microservice frameworks.
Spring Cloud Data Flow was designed to be the single replacement to the question mark (?) shown above. It’s a single toolkit that developers can employ to create, orchestrate, and refactor data pipelines through one programming model to address common use cases such as data ingestion, real-time analytics, and data export/import across popular source and destination systems. SCDF allows developers to interact to define and deploy data pipelines through multiple endpoints like:
- Dashboard GUI (that allows defining pipeline through a fluid drag and drop palette)
- Command-Line Shell
- Stream Java DSL
- RESTful APIs
Features of Spring Cloud Data Flow
- Orchestrate applications across a variety of distributed runtime platforms, including Cloud Foundry, Apache YARN, Apache Mesos, and Kubernetes.
- Pluggable messaging broker binders let developers use the same application code and bind to any popular messaging services like RabbitMQ, Apache Kafka, Amazon Kinesis, Google Pub/Sub, Azure Event Hubs, or RocketMQ.
- Design, deploy, and manage data pipelines using: Java DSL, Shell, REST-APIs, and Admin-UI.
- The programming model offers runtime and message broker abstractions.
- Building streaming and batch applications using popular configuration driven Spring Boot backed Spring Cloud Stream and Spring Cloud Task projects.
- Take advantage of metrics, health checks, and remote management of data-microservices.
- Standard security semantics in the form of OAuth2 and OpenID Connect backed authentication and authorization.
- Scale stream and data pipelines with almost zero downtime without interrupting the data flow.
- UI dashboards to design, deploy, and manage large-scale and compute-intensive batch data pipeline through Spring Batch jobs.
Data Flow Server
The core of the SCDF ecosystem is the Data Flow Server, a Spring Boot based microservice application that provides the main entry point to define data pipelines in SCDF through RESTful APIs and a web dashboard. This server is responsible for parsing the stream and batch job definitions based on a Domain Specific Language (DSL). The server requires a relational database to persist the metadata related to stream, task, or job definitions and register artifacts such as additional library jar files or docker images used in the pipeline definitions. The data flow server can deploy the Batch Jobs to one or more supported runtime platforms.
The skipper server in the SCDF ecosystem is responsible for deploying the streaming data pipeline definitions to one or more supported runtime platforms using the Spring Cloud Deployer family of libraries. It is a Spring Boot based microservice that behaves as a package manager that installs, upgrades, and rolls back applications to one or more runtime platforms using a blue-green deployment strategy. Just like the Data Flow server, it also exposes RESTful APIs to access the functionalities it offers for stream deployment and application management.
Both the Data Flow server and the Skipper server use a relational database to maintain metadata and the state of jobs, applications, and streams, etc. Supported databases are H2, HSQLDB, MySQL, Oracle, Postgresql, DB2, and SqlServer.
Data Flow Shell Client
Data Flow Shell Client is an optional component that provides a command-line interface for interacting with the Data Flow Server. It helps with common functionalities of Data Flow Server such as deploying or uninstalling an app or task and creating/deploying a pipeline.
There are two types of applications that are supported by Spring Cloud Data Flow.
- Short-Lived Applications: They run for a finite period of time (minutes or hours) and then terminate. The executions are triggered on a recurring schedule (such as every day at midnight) or as a response to some external event (such as a file being copied into a landing zone). These applications form parts of the batch job data pipeline that may be visualized as components of a traditional ETL setup. An example of a short-lived batch application may be a job that connects to a data warehouse like Amazon Redshift or Apache Hive and runs data profiling over the data each week.
These applications are developed using the Spring Cloud Task framework that records lifecycle events (such as the start time, end time, and the exit code) of the application into the relational database attached to the Data Flow Server. These applications can also be developed as Spring Batch jobs since Spring Cloud Task is well-integrated with it.
A short-lived application is registered with Data Flow using the category name task to describe the type of application.
- Long-Lived Applications: These run continuously as part of the data-streaming pipeline. A typical data-streaming pipeline operation involves consuming events from external systems, processing or transforming the data from the events, and writing to persistent storage. In SCDF, these event-streaming pipelines are generally composed of Spring Cloud Stream applications which are broadly categorized as Source, Processor and Sink applications:
- A source represents the first step in the data pipeline. It is a producer that consumes data from external systems like databases, filesystem, FTP servers, IoT devices, etc.
- A processor represents an application that can consume from an upstream producer (a source or another processor), perform the business operation on the consumed data and emit the processed data for downstream consumption.
- A sink represents the final stage in the data pipeline, which can persist the consumed data to external systems like HDFS,Cassandra, PostgreSQL, Amazon S3, etc.
Many source, processor, and sink applications for common use-cases (e.g. s3, jdbc, hdfs, http, router) are already developed and provided as publicly consumable pre-built applications by the Spring Cloud Data Flow team. A developer can directly use or extend any out-of-the-box utility applications to cover common use cases or write a custom application using Spring Cloud Stream.
The Spring Cloud Stream framework provides a programming model to simplify the writing of message-driven microservice applications connected to a common messaging system. This enables the developers to write core business logic that is agnostic to the specific messaging middleware that is realised by adding a Spring Cloud Stream Binder library as a dependency to the application.
A DSL for stream data pipeline definition in SCDF follows the Pipe and Filters architectural pattern. This simple architectural style connects several components that process a stream of data, each connected to the next component via a pipe. The most popular adoption of the pattern is the Unix shell, and a stream data pipeline is defined by using a Unix-inspired pipeline syntax. The syntax uses vertical bars, known as “pipes”, to connect multiple commands. For example the command
% cat input.txt|grep “text”|sort > output.txt
reads the input.txt and pipes it to the input of the grep “text” which processes the content to search for the pattern “text” and then passes the output as input to sort, which sorts the results and outputs into the file output.txt. Each | symbol connects the standard output of the command on the left to the standard input of the command on the right.
In SCDF, the Unix command is replaced by a Spring Cloud Stream application and each pipe symbol represents connecting the input and output of applications over messaging middleware, such as RabbitMQ or Apache Kafka.
Applications of both the types (Spring Cloud Task or Spring Cloud Stream) can be packaged in two ways:
- Spring Boot uber-jar that is hosted in a Maven repository, file or http
- Docker App Images
They are registered to the Spring Cloud Data Flow through the Data Flow Server.
A messaging middleware service is required to facilitate communication between applications in a SCDF pipeline. The framework provides a programming model that allows support for pluggable message binder libraries. The currently available binders support following messaging broker services:
The applications in a data pipeline require a runtime environment to execute. Some of the common runtimes that are supported out of the box are Local Server, Cloud Foundry, Apache YARN, Kubernetes, and Apache Mesos.
The following figures show the different components of the Spring Cloud Data Flow framework and how they interact with each other.
Batch Job Pipeline
Stream Data Pipeline
This article introduced a Spring Cloud Data Flow toolkit that helps developers set up cloud-native microservice-driven data pipelines to address common pipeline implementation challenges. We have explained all the core components of the Spring Cloud Data Flow ecosystem and presented an architectural overview on how the components interact with each other at runtime. This is the first blog in a new series on Spring Cloud Data Flow. In the next blog, we’ll cover how to build an end-to-end data pipeline using Spring Cloud Data Flow and discuss the operational efficiency introduced in the pipeline by leveraging the metadata management capabilities of Arena.