Elasticsearch is an increasingly powerful, useful and widely-used tool in big data, and for good reasons. It is a flexible and feature-rich platform for indexing, searching and storing data at scale, and provides a slick platform on which to base user-facing UIs. Adding Kibana provides another level of visualization and analytics, and the various other applications in the ELK stack add further functionality and value for your Hive basics.
One perpetual weakness of ELK, however, is the need to store all data within ElasticSearch. Although Logstash is a robust interface for data ingestion, creating config files and mappings for each data source quickly becomes cumbersome. With the ubiquity of Hive among Hadoop systems, a natural solution would be to extend the existing external table structures to allow for integration with Elasticsearch, and this is exactly what the Elasticsearch-Hadoop package does. Because Hive can act as either a data store or simply a metadata catalog for external stores, this leads to a powerful management system.
Sample Architecture Using Elasticsearch as a Datastore
By using the Hive connector for Elasticsearch, we can create external Hive tables that use Elasticsearch as a datastore as if it was HDFS. This means that as we add data or tables to Hive, indices are transparently being created to manage this data. They will be indexed, analyzed, and optimized just like any other Elasticsearch index, with the added benefit of opening up SQL-style queries through Hive. The resulting architecture will look like this:
This architecture is desirable for a few reasons:
- Simplicity. By eliminating the need to ingest data twice, there is less chance for error, less overhead, and less complexity in troubleshooting.
- Transparency. Once a table is defined in Hive, any application can execute SQL queries on the data contained in the table, regardless of the storage layer, and furthermore, this model allows Hive to truly act as a primary metastore.
- Scalability. Often, best practice is to hold back on the indexing of petabytes of data in Elasticsearch unless it is 100% necessary; in this setup, Hive can handle the bulk of the data, while specific tables or even fields can be indexed.
Hive Basics: Connector for Elasticsearch Concerns
Of course, no system exists without tradeoffs. In particular, a few concerns pop up with Hive-Elasticsearch integration.
- Existing tables need to be copied to new tables using
CREATE EXTERNAL TABLE elastic_table(column_list) STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES(‘…’); INSERT OVERWRITE TABLE elastic_table SELECT * FROM hive_table;
This process can be lengthy for large tables; it will also depend on latency between the Hive and Elasticsearch nodes. The schema of the new table must match the existing one exactly; this is easy enough to pull from a describe statement but requires some thought.
- In order to truly streamline the process of adding data from Hive to Elasticsearch, the es.index.auto.create flag must be used. However, this means that Elasticsearch will read the Hive data types and use those to create mappings. Usually, this is fine, but when more complex data types are needed, such as GeoPoints or Objects, manual mappings will be required.
- If Hadoop-level redundancy is required, data will need to be maintained in both Elasticsearch and Hive. Elasticsearch is redundant, but cannot make the same replication and recovery SLAs as HDFS.
- Dropping an Elasticsearch-managed table from Hive will not drop the corresponding data from Elasticsearch.
We’ll dig in deeper to the details and tradeoffs in another post, but for now, let’s just look quickly at the relatively simple requirements for setting up the Elasticsearch-Hadoop connector.
- Download the appropriate version of Elasticsearch-Hadoop for your ES environment from https://www.elastic.co/downloads/past-releases. Add the jar file to your Hive aux-lib path or add it to the session via ADD JAR in the Hive CLI.
- Restart HiveServer2.
- Make sure your ES cluster is up, and is accessible from the Hive node.
- In Beeline or the Hive CLI, execute the following statements:
CREATE EXTERNAL TABLE elastic_table(column_list) STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES(‘es.resource’=’index/type’, ‘es.nodes’=’serverIP:port’, ‘es.index.auto.create’=’TRUE’); INSERT OVERWRITE TABLE elastic_table SELECT * FROM hive_table;
- Check Elasticsearch to make sure that the index and type were created and that documents exist
If your index was created and populated successfully, then congratulations! You just created an Elasticsearch index using Hive.
Using these Hive basics recommendations, you can visualize the data using Kibana, search it using ES or other search tools, and use it just like you’d use any other index. Make sure to check the fields in the index to make sure the type and properties were correctly generated; for example, numeric fields could sometimes be read as strings, or your field might be marked as Analyzed when it should not be. Even if there are some errors, though, I’ll gladly take a little bit of tweaking over manual creation of mappings any day.