March 2nd, 2017
Stream data processing seems to be the next ‘big thing’ in big data. With a couple open source projects advertising streaming engines – Flink, Beam, and Apex – we decided to jump in and test one of them out for our data lake customers. Flink seems to be the most mature in the segment, having just announced its 1.0.0 release.
We want to stream stock prices and develop real-time metrics to report to the user. In this case, our metric is the 5-minute moving price average. To connect this back to a real-world use case, this is sometimes used by traders to get a sense of whether a security is currently underpriced or overpriced, albeit at larger time intervals.
One of the biggest obstacles to much of financial analysis is finding free sources of data. Fortunately, Google provides us with a URL for each security that we can hit every minute to get an updated JSON that includes the current price.
A line has been drawn in the sand between the stream processing frameworks named previously, and Spark Streaming. Behind the scenes, Spark Streaming is really batch-processing system that appears as a streaming system. It provides the illusion of stream processing by creating ‘micro-batches’ across time that are processed individually. One drawback to this is that sometimes splits up related data across batches, creating issues when running analysis on data that spans multiple batches. However, for time-agnostic use-cases, this might not be an issue at all.
When calculating moving price averages with the batch system, if one of our prices is delayed for some reason, our moving average for that timeframe would be slightly skewed. However, with a true streaming system like Flink, the order that the event is received does not matter, since we perform operations based on the event time.
The Google endpoint has a JSON object that is updated every minute with properties such as stock price, time, volume, etc. For our purposes, we only need the stock price and the time. At this point, we have a couple options as to how we want to send this over to Flink. Flink has a couple of built-in ‘stream connectors’ that include Kafka, Flume, Twitter, as well as being able to receive data from a specific port. We went with Kafka, but another option would be to push new data to a Netcat process at a specific port.
The Kafka program polls the specified Google endpoint every minute and reads the JSON. It then pulls out the fields that we need, concatenates them with a comma delimiter, and pushes that out to a topic. For multiple stock tickers, we could add a key that would allow Kafka to partition the topic based on the ticker name (‘GOOG, ‘AAPL’, etc.).
Flink is watching this Kafka topic, so once the values are pushed in, we can receive them and start working with them.
The above Flink code first defines a Kafka stream connector, which allows us to create a Datastream object that can be acted upon by Flink transformations. The first transformation splits up the received records into time and price. We then define 5-minute windows, over which we take the average of the prices that we count. We then return this value to the user.
The above exercise was useful in getting familiar with the mechanics of stream processing using Flink. At Zaloni, we are seeing a growing number of business use-cases that call for low-latency processing of unbounded data sets, and so we expect to see more widespread adoption of stream-processing frameworks. Maturity, or lack of, might have been a criticism in the past – that seems to no longer be an issue.