Blogs

HCat: The Best Ways to Get Started with HCatalog

Avatar photo Team Zaloni July 12th, 2018

HCatalog aka HCat

HCatalog, also called HCat, is an interesting Apache project. It has the unique distinction of being one of the few Apache projects that were once a part of another project, became its own project, and then again returned to the original project Apache Hive.

HCat itself is described in the documentation as “a table and storage management layer” for Hadoop. In short, HCat provides an abstraction layer for accessing data in Hive from a variety of programming languages.  It exposes data stored in the Hive metastore to additional languages other than HQL. Classically, this has included Pig and MapReduce. When Spark burst onto the big data scene, it allowed access to HCat.

Using HCat means leveraging an abstraction layer that lets programmers focus on the task at hand, not file format issues. This is done using what is called a “SerDes” or serializer/deserializer. It translates a programming object into a series of bytes and back again. For those of you who are not Java programmers, this is a piece of Java code used to allow HCat and Hive to understand how to exchange information in a particular format.  

Getting Started with HCatalog

In general, you would use HCatalog to upload data to the distributed file system, define the data in Hive, and then access the data via a technology of your choice using the appropriate HCatalog statement for the language used.

Accessing Data with HCatalog

Below is a short example of HCatalog being used with a chosen technology:

Pig – Pig uses HCatLoader and HCatStorer. Please see the very detailed Hortonworks tutorial on use of HCat for full worked examples.a = LOAD ‘TABLENAME1’ using org.apache.hive.hcatalog.pig.HCatLoader () ;b = LOAD ‘TABLENAME2’ using org.apache.hive.hcatalog.pig.HCatLoader () ;
c = join b by colname1, a by colname1;
dump c;

Hive – Hive uses HCat directly so there is no need for special code. Simply define your table as you would in the Hive CLI and it will be accessible via HCat. View a Hive architecture.

MapReduce – MapReduce can also access data via HCat. See a fully worked example is available here. In short, adjust your mapper, reducer and driver to use HCat.// Get table schema in mapper
    HCatSchema schema = HCatBaseInputFormat.getTableSchema(context) ;
    Integer var1= new Integer(value.getString(“var1”, schema) ) ;
    // define output record schema
    List columns = new ArrayList (3) ;
    columns.add(new HCatFieldSchema(“year”, HCatFieldSchema.Type.INT, “”) ) ;     record.setInteger (“year”, schema, key.getFirstInt() ) ;

SparkSQL – Spark can leverage several languages including Scala, Python, Java and R. Of course, one can simply use Spark SQL to simply run native HQL commands (which natively interact with Hcat).val a =  hiveContext.hql (“from data.test select country, prodID”)

Spark – What if you would like to access Hive from Spark without Spark SQL? Spark code accesses the Hive metastore directly.

Interacting with HCat through WebHCat

Simply put WebHCat is the REST API for HCatalog. This allows for all sorts of scenarios where interacting with HCatalog might be required but cannot be done using other methods. The easiest way to demonstrate WebHCat is via curl. You will notice the name “Templeton” in the URL which is the old name for WebHCat.curl -s ‘https://localhost:50111/templeton/v1/status'{“status”:”ok”,”version”:”v1″}
curl -s ‘https://localhost:50111/templeton/v1/ddl/database/default/table/sample_07?user.name=hive’
{    “columns”:[       {          “name”:”code”,        “type”:”string”     },     {          “name”:”description”,        “type”:”string”     },     {          “name”:”total_emp”,        “type”:”int”     },     {  “name”:”salary”,        “type”:”int”     }  ],  “database”:”default”,  “table”:”sample_07″}

Connecting to Hive with HiveServer2

HiveServer2 (HS2) is a connection layer to allow client connections to Hive. This includes a TCP or HTTP based Hive Service layer and like most Hadoop services a web interface. One of the easiest ways to connect is to use the built in client called beeline that comes with Hive. This is the technology that allows many BI tools in the Hadoop market to make use of Hive today.beelineWARNING: Use “yarn jar” to launch YARN applications.Beeline version 1.2.1000.2.4.0.0-169 by Apache Hivebeeline> !connect jdbc:hive2://localhost:10000/defaultConnecting to jdbc:hive2://localhost:10000/defaultEnter username for jdbc:hive2://localhost:10000/default: hiveEnter password for jdbc:hive2://localhost:10000/default: ****Connected to: Apache Hive (version 1.2.1000.2.4.0.0-169)Driver: Hive JDBC (version 1.2.1000.2.4.0.0-169)Transaction isolation: TRANSACTION_REPEATABLE_READ0: jdbc:hive2://localhost:10000/default> show databases;+—————-+–+| database_name  |+—————-+–+| default        || xademo         |+—————-+–+2 rows selected (2.867 seconds)0: jdbc:hive2://localhost:10000/default>
You can think of HS2 as a templeton based service allowing remote access to the Hive command line. So again, in this scenario any tables created will automatically be available via HCatalog since you are essentially working at the Hive cli.
0: jdbc:hive2://localhost:10000/default> show tables;+————+–+|  tab_name  |+————+–+| sample_07  || sample_08  |+————+–+2 rows selected (0.381 seconds)0: jdbc:hive2://localhost:10000/default> create table testtable( eid int);No rows affected (10.69 seconds)0: jdbc:hive2://localhost:10000/default> show tables;+————+–+|  tab_name  |+————+–+| sample_07  || sample_08  || testtable  |+————+–+3 rows selected (0.351 seconds)
hcat -e “show tables;”WARNING: Use “yarn jar” to launch YARN applications.OKsample_07sample_08testtableTime taken: 8.589 seconds

Learn more about HCatalog and compatible technologies

HCatalog is a way for many different technologies to share in the tables defined in Hive without having to write low level integration with the Hive Metastore. Without HCatalog, the ability to simply reuse existing data becomes more cumbersome. Aside from the fact that Hive is the technology in Hadoop that looks and feels the most like everyone’s beloved RDBMS, HCatalog is what allows a multitude of Hadoop command line tools to interact with Hive. HS2 then provides the easiest way for a sea of BI tools to connect to Hive and leverage tables in Hive directly.

hcat secure the unrealized power of your data click for demo

about the author

This team of authors from Team Zaloni provide their expertise, best practices, tips and tricks and use cases across varied topics incuding: data governance, data catalog, dataops, observability, and so much more.