Blogs

Fast Hive: Tez and LLAP Improvements to Improve Hive Speed

Avatar photo Team Zaloni February 2nd, 2017

Before the days of Spark, there was a huge Cloudera vs Hortonworks fight over what was to be the SQL/RDBMS based solution on Hadoop. Hortonworks having a choke hold on the Hive project espoused what it knew which was Hive. From the Hortonworks perspective, if it’s not fast enough then let’s work on Hive to make it better.

Enter Project Stinger from Hortonworks. It provides three phases of changes including things like ORC and Tez, along with other improvements like the Cost based Optimizer for Hive. This push to be seen as an innovator in the ever-changing landscape also forces Cloudera to create Impala.

Many think that Project Stinger and Impala are direct competitors but that statement couldn’t be further from the truth. They both do SQL or at least approximate what could be done in RDBMS land. They both look, act and smell like the right answer for simply pushing my ETL/ELT to Hadoop with little code changes. “Look! Impala is faster!”  Even though Hive took its next counterpunch called Tez (Hindi for speed) it still didn’t touch Impala for speed. Impala was not writing MapReduce jobs to be run in batch but instead was operating in-memory to provide speed. This is great but limits it to operations fitting in-memory. Along came Spark and the entire issue seemed sidelined. It’s also fun to note that at the end of the day, both Spark and Impala leverage Hive to round out their functionality.

The pace of Hive development slowed from the furious days of 2014 at Hortonworks but never quite stopped. Cloudera and Hortonworks both adopted Spark as an alternative DAG engine. The in-memory quest at Hortonworks to make Hive even faster continued and culminated in Live Long and Prosper (LLAP). This blog is a quick intro to both Tez and LLAP and offers considerations for using them.

Tez Offers Improvements for Hive

Tez was initially an alternative execution engine for Hive. In reality, it could be used on its own as it was a freestanding Apache project. The base value of Tez had more complex expressions of DAG job tasks. Developers were no longer limited to simply map and reduce but they could develop complex networks of Map and Reduce tasks. Ideas like a set of job tasks that do Map then Reduce and Reduce again could not be expressed previously. More importantly, Tez allowed for the intermediate results of tasks to go directly to the next task skipping the dreaded write to disk step that was so costly when processing big data. My personal favorite was a feature that was added to include container decay delay. This means that the Yarn container that was allocated would not immediately disappear. Subsequent repeated identical queries would run dramatically faster since they did not incur the cost of a new container launch. Containers were reused allowing shared access to data along with lower latency queries again due to the lack of container setup time. There were more improvements involved in Tez but the core changes described above were enough to confer a dramatic improvement in Hive.

Long Live and Prosper (LLAP)

The newest chapter in the Hive saga called Long Live and Prosper, comes as a part of the project Stinger.Next. It basically provides a whole new architecture for Hive which includes an entire new daemon for certain queries (mostly short running) orchestrated by the Tez Application Master (yes Tez requires Yarn). This confers sub second queries to Hive. So, speed again is the main target.

LLAP

My personal favorite feature that was included in the project is fine grained column level access control for tables. But there are so many more things to brag about from Transaction support to potential deployment on Apache Slider. Let’s not forget a little thing like ANSI SQL (2011) compliance.

Realizing the Promise of a Data Lake eBook

While the stated goal of LLAP was to be done in 2015, the project has taken somewhat longer than expected. In real terms, there is a massive set of improvements that is still being undertaken. Although, these improvements are available in preview mode in the latest Hortonworks Sandbox. Hive is still the go-to tech for many BI vendors and new adopters of Hadoop. It’s important that it continues to improve for everyone’s sake.

For further reading, please enjoy the following article:

Hive Basics - Elasticsearch Integration

about the author

This team of authors from Team Zaloni provide their expertise, best practices, tips and tricks and use cases across varied topics incuding: data governance, data catalog, dataops, observability, and so much more.