Data Lakes and Real-Time Data Streaming
More than ever, streaming technologies are at the forefront of the Hadoop ecosystem. As the prevalence and volume of real-time data continues to increase, the velocity of development and change in technology will likely do the same. However, as the number and complexity of real-time data streaming technologies grow, consumers of Hadoop must face an increasing number of choices with increasingly blurred delineation of functionality.
While it would be hard to encompass all information about streaming ingest in one page (or even one book), this post is meant to provide a basic overview of the ways in which various Hadoop technologies fit into the data lake and provide a jumping-off point for further exploration. Notably, there are many, many more technologies than are shown in the diagram below — these are just a few of the most common.
The first point to make when considering streaming in the data lake is that although many of the available technologies are incredibly flexible and can be used in multiple contexts, a well-executed data lake provides strict rules and processes around ingestion. For example, Kafka and Flume both allow connections directly into Hive and HBase, and Spark can ingest and process data without ever writing to disk. This functionality is robust, but also compromises the original, unaltered data, which is a main principle of data lake architecture. Because of this, we generally restrict the ways in which data flows through the system. Data must be ingested, written to a raw landing zone where it can be held, and copied to another zone for processing and enrichment.
Flume and Kafka are the two most well-established messaging systems in use today. An extremely simple analysis of these products is below.
Kafka is the newer of the two technologies, but is quickly gaining traction as a robust, scalable and fault-tolerant messaging system. Whereas Flume can be thought of as a pipe between two points, Kafka is more of a broadcast, making data “topics” available to any subscribers who have permission to listen in. This makes Kafka, as a whole, more scalable than Flume, and also provides mechanisms for fault tolerance and redundancy of data. If one Kafka agent goes down, another will re-broadcast the topic. Where Kafka does fall short is in commercial support. Currently, Cloudera includes Kafka, but MapR and Hortonworks do not. Additionally, Kafka does not include built-in connectors to other Hadoop products. Some have been pre-written, but in general, you can’t expect the same level of “out-of-the-box” connectivity as Flume.
Flume has historically been the only choice for streaming ingest and as such, is well-established in the Hadoop ecosystem and is supported in all commercial Hadoop distributions. For large, enterprise-wide Hadoop deployments, this is an attractive, or even essential feature and may be reason enough to choose it. Despite its age, Flume has largely stayed fresh with emerging Hadoop technologies. Flume is a push-to-client system and operates between two endpoints rather than as a broadcast for any consumer to plug into. A downside is that in, a case of a Flume failure, data will be lost, as there is no replication of events.
It’s worth noting that Kafka and Flume actually provide connectivity to each other, meaning that they are not necessarily mutually exclusive. Flume includes both a sink and a source for Kafka, and there are various documented cases of linking the two, even in large-scale, production systems. For small-scale or early-stage systems, unless there is a compelling and apparent need for both, it’s best to choose one system, based on current and expected needs.
Once you have a stream of data headed for your data lake, there are several options for getting that data into a storable, consumable form. With Flume, it’s possible to write directly to HDFS with built-in sinks. Kafka does not have any such built-in connectors. Both systems, however, can benefit from the addition of a stream-processing framework within the Hadoop ecosystem. A few of these frameworks are listed below.
Storm is a true real-time processing framework, taking in a stream as an entire “event,” rather than a series of small batches. This means that Storm has very low latency and is well-suited to data that must be ingested as a single entity. Storm has been used in production instances for the longest of the three solutions here, and has commercial support available. However, Storm does suffer from a lack of direct YARN support (it can be run on Mesos or as a Slider process on YARN), and cannot guarantee that data will be processed only once.
Spark is widely known for its in-memory processing capabilities and the Spark Streaming component operates on much of the same basis. Spark is not a truly a “real-time data streaming” system. Instead, it processes in micro-batches at defined intervals. While this introduces latency, it also ensures that data is processed reliably, and only once. And, of course, Spark Streaming interfaces seamlessly with traditional Spark processing, making development easier.
Flink is somewhat of a hybrid between Spark and Storm. While Spark is a batch framework with no true streaming support and Storm is a streaming framework with no batch support, Flink includes frameworks for both streaming and batch processing. This allows Flink to offer the low latency of Storm with the data fault tolerance of Spark, along with several user-configurable windowing and redundancy settings. The largest problem with Flink at this point is a lack of existing production deployment, as well as a lack of native commercial support from major Hadoop distributions.
Taking the next step towards real-time data streaming
This is about as brief of an overview as possible to streaming ingestion in the data lake as it stands today. Products like NiFi, Ignite Streams, Beam, Samza and countless other developing technologies are on the horizon. However, when taking the first steps towards understanding the topic of streaming, the solutions mentioned in this post cover the bulk of use cases for normal operation. Even in the simplest case, ingestion, and in particular streaming ingestion, is a messy, complex process that can quickly become unmanageable. Tools such as Zaloni’s Arena help to unify the distinct solutions, no matter which you decide are right for your business, even as you grow, scale and evolve your Hadoop ecosystem.