November 9th, 2017
Elasticsearch (ES) is a powerful search platform with a host of other features and capabilities, especially when paired with Hadoop via the Elasticsearch-Hadoop connector. One interesting aspect of Elasticsearch is storage; ES uses a replication and sharding strategy to obtain data resiliency, which is similar to HDFS on first glance.
This raises the question: could Elasticsearch be considered a primary datastore for at least some of the data in a data lake environment? This would certainly be convenient, especially considering the ease by which Hive can create external ES tables. As is the case so often in Hadoop, we’ll see that the answer is muddy at best, depending heavily on the applications and use cases considered.
Before we dive into why using Elasticsearch to store data may or may not be a good idea, let’s first look at how ES stores data, and some of the differences between ES and HDFS. As mentioned earlier, ES uses a replica/shard system wherein data (in this case, an index) is sharded into a set number of distinct groups, each being replicated a specific number of times. For example, consider a cluster with 2 data nodes and 2 indices, where index 1 has a replication factor of 0 with 2 shards, and index 2 has a replication factor of 1 with 2 shards.
There are a few main points to notice:
So far, so good! Elasticsearch looks to have most of the durability features we expect out of HDFS. One main difference in this paradigm, though, is that this redundancy is document based, which means that we do not have any guarantee of file based redundancy. Losing a shard could mean losing an entire index, or a single file, or a mix of multiple files. The same could technically be said of HDFS, which replicates data in blocks, but the analogy is one step further, since HDFS also has awareness of file metadata. Elasticsearch also has metadata, but relating mostly to index details, such as creation time and index mappings.
One potential way to approach this difference, if it is an issue, is to use types within an index; although this does not necessarily add extra resiliency, it does at a contextual layer to the document-based store, allowing logical separation of different tables or files. This is how Hive manages the Elasticsearch connection- each table is essentially given an index and a type. Still, if table or file persistence is important, Elasticsearch may not fit.
Another important distinction comes in sizing and scaling of clusters. While HDFS emphasized the use of low-cost, high-density storage nodes, Elasticsearch is slightly more demanding on the (relatively) costly compute side. Generally, enterprise-scale Elasticsearch nodes will demand between 10:1 and 15:1 storage-to-compute ratio; that is, if a cluster is sized to handle 50TB of data, it will need between 3TB and 5TB of RAM. Compared with a similar HDFS system, which might require 50GB-100GB of RAM for the same storage space, this is clearly costly.
The difference is that Elasticsearch is doing much more than just storing the data; it is analyzing it, optimizing it, and providing near real-time results to search queries. A more apt comparison than ES-to-HDFS might be to consider HDFS, Spark, Hive, and Ambari; even so, this combined system would still consume less RAM, and we’re talking about ES as a storage option here.
Realistically, Elasticsearch does not scale as a primary storage means for peta-scale environments, but it is unrealistic to expect it to do so. What is more realistic is the selective indexing of high-value data through a system like Hive, which can then be used to execute SQL queries, interact with Spark, and expose the data to Hadoop while maintaining all of the benefits of Elasticsearch.
This type of architecture lets each system do what it excels at; Hive acts as the metastore and bridge between the classic Hadoop ecosystem and the extended universe of Elasticsearch, while Elasticsearch can leverage data natively, leveraging Kibana and other 3rd party apps to add additional value. The table below summarizes the key differences between ES and HDFS.
So, what’s the bottom line? Elasticsearch is a fully capable data store with many of the resiliency features of HDFS underlying its robust search functionality. This comes at the cost of high memory requirements, potential latency issues when updating data, and a document-based taxonomy that loses most file and table context. Although this means that Elasticsearch probably shouldn’t store all your data, that doesn’t mean it’s not suited to store some of your data.
Hive provides an easy-to-use bridge between Hadoop and Elasticsearch, and the robustness behind the data redundancy means you don’t have to give up much by choosing to store exclusively on Elasticsearch and not duplicating data locally on HDFS. Particularly for data that will not ever need to be processed in tools other than Hive or Hive-Compatible tools. For example, log data that will mostly be used for end-user visualization (i.e., in Kibana), real-time feeds that can be discarded or offloaded after a few days, and other data that tends to be high-impact without much massaging are all good candidates for direct Elasticsearch storage. Who knows- given the rapid drops in cost of technology, Elasticsearch may one day be a viable solution for whole-data lake storage (of course, by then, we’ll all be producing zettabytes of data every day from our flying, self-driving cars).