Elasticsearch is an increasingly powerful, useful and widely-used tool in big data, and for good reasons. It is a flexible and feature-rich platform for indexing, searching and storing data at scale, and provides a slick platform on which to base user-facing UIs. Adding Kibana provides another level of visualization and analytics, and the various other applications in the ELK stack add further functionality and value for your Hive basics.
One perpetual weakness of ELK, however, is the need to store all data within ElasticSearch. Although Logstash is a robust interface for data ingestion, creating config files and mappings for each data source quickly becomes cumbersome. With the ubiquity of Hive among Hadoop systems, a natural solution would be to extend the existing external table structures to allow for integration with Elasticsearch, and this is exactly what the Elasticsearch-Hadoop package does. Because Hive can act as either a data store or simply a metadata catalog for external stores, this leads to a powerful management system.
By using the Hive connector for Elasticsearch, we can create external Hive tables that use Elasticsearch as a datastore as if it was HDFS. This means that as we add data or tables to Hive, indices are transparently being created to manage this data. They will be indexed, analyzed, and optimized just like any other Elasticsearch index, with the added benefit of opening up SQL-style queries through Hive. The resulting architecture will look like this:
This architecture is desirable for a few reasons:
Of course, no system exists without tradeoffs. In particular, a few concerns pop up with Hive-Elasticsearch integration.
CREATE EXTERNAL TABLE elastic_table(column_list) STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES(‘…’); INSERT OVERWRITE TABLE elastic_table SELECT * FROM hive_table;
This process can be lengthy for large tables; it will also depend on latency between the Hive and Elasticsearch nodes. The schema of the new table must match the existing one exactly; this is easy enough to pull from a describe statement but requires some thought.
We’ll dig in deeper to the details and tradeoffs in another post, but for now, let’s just look quickly at the relatively simple requirements for setting up the Elasticsearch-Hadoop connector.
CREATE EXTERNAL TABLE elastic_table(column_list) STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES(‘es.resource’=’index/type’, ‘es.nodes’=’serverIP:port’, ‘es.index.auto.create’=’TRUE’); INSERT OVERWRITE TABLE elastic_table SELECT * FROM hive_table;
If your index was created and populated successfully, then congratulations! You just created an Elasticsearch index using Hive.
Using these Hive basics recommendations, you can visualize the data using Kibana, search it using ES or other search tools, and use it just like you’d use any other index. Make sure to check the fields in the index to make sure the type and properties were correctly generated; for example, numeric fields could sometimes be read as strings, or your field might be marked as Analyzed when it should not be. Even if there are some errors, though, I’ll gladly take a little bit of tweaking over manual creation of mappings any day.
News By: Team Zaloni
Blogs By: Matthew Caspento
Blogs By: Haley Teeples
Blogs By: Team Zaloni
Blogs By: Team Zaloni
Blogs By: Pranjal Goswami