The very first step to work on any technology is to understand the basics. This article aims to help you get familiar with the basic terminology and concepts of Apache Hive, plus an inside look at how Arena makes it easier to work with Hive.
Introduction to Hive
Apache Hive is an open-source, distributed, fault-tolerant data warehousing software built on top of Apache Hadoop. It is designed to store, process large datasets and perform analytics at a massive scale on structured and semi-structured datasets. A data warehouse is the central storage of information that can easily be accessed and analyzed to make informed, data-driven decisions. HiveQL is an SQL-like interface used to read, write and manage petabytes of data using SQL syntax. What makes Hive unique is how it works behind the scenes. It leverages Mapreduce or Apache Tez with the SQL-like interface to query large datasets. The datasets are stored in a distributed file system such as Hadoop Distributed File System (HDFS) or Amazon S3. The database and table metadata corresponding to the dataset is stored in metastore, a database or a file-backed store that enables easy data abstraction and discovery.
Components of Hive
There are two key components of Hive:
Hcatalog enables users with different data processing tools like Pig or Mapreduce to read and write data more easily. HCatalog reads data from Hive metastore and provides seamless integration with other engines by allowing Pig or Mapreduce to use Hive’s same data structures, resulting in no changes in metadata for each engine. HCatalog’s table abstraction presents a relational view of the data in HDFS and ensures that users need not worry about where or in what file format their data is stored. HCatalog supports reading and writing files in any format for which a SerDe can be written.
WebHcat is a RESTful API provider for HCatalog, primarily designed for third party tools to use HTTP requests to access and reuse Hive metadata and to run Hadoop Mapreduce, Pig or Hive jobs.
File Formats Supported in Hive
Out of the box, Hive supports Text File, SequenceFile, CSV, JSON, AVRO, RC, ORC and Parquet file formats. In addition to these, one may use a custom format by writing a SerDe for that particular input output format. The hive.default.fileformat configuration parameter determines the format to be used if it is not specified in a CREATE TABLE or ALTER TABLE statement. The default value is Text file.
Hive currently uses these FileFormat classes to read and write HDFS files:
- TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
- SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in Hadoop SequenceFile format.
What is a SerDe?
SerDe is short for Serializer and Deserializer. Hive uses SerDe along with File Format to read data from the Hive table and write data back to HDFS location corresponding to the Hive table. At a high level, this is how SerDe works-
- HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object
- Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files
The actual data is stored as the value, and the key is a pointer to the data. One principle of Hive is that it doesn’t own data in HDFS, which means users should be able to read or write data in HDFS directories without having to use the SerDe. The modified files can still be loaded to the Hive table using CREATE EXTERNAL TABLE or LOAD DATA INPATH.
Hive currently uses these SerDe classes to serialize and deserialize data:
- MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (sorry, quote is not supported yet).
- LazySimpleSerDe: This SerDe can be used to read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol; however, it creates Objects in a lazy way, which provides better performance.
SerDe Properties can be added in Hive table DDLs to enhance the capabilities while performing read/write operations using a particular SerDe.
ALTER TABLE xyz SET SERDEPROPERTIES ('serialization.encoding'='GBK');
One important thing to note here is that SerDe defines the table schema, not DDLs. SerDe implementations may use the DDL configuration for defining the table schema, but SerDe can also override the DDL. Consider the example of OpenCSVSerDe. This SerDe requires all fields of the table to be of type STRING. Even if in the DDL one defines fields as non STRING, say INT, it will internally convert the fields to STRING and create the schema.
For native SerDe’s, instead of defining SerDe Class, InputFomat and OutputFormat, a keyword called ‘STORED AS’ can be used in CREATE TABLE statements. The syntax list below can be used in the CREATE TABLE statement, which is equivalent to providing SerDe, InputFormat and OutputFormat explicitly.
STORED AS AVRO/AVROFILE
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
STORED AS ORC/ ORCFILE
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
STORED AS PARQUET / PARQUETFILE
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
STORED AS RCFILE
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
STORED AS SEQUENCEFILE
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
STORED AS TEXTFILE
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
Apache Hive Table Types
Tables in Hive are categorized into three types:
A Hive table is created as an external table if the directory having data for the table is not maintained by Hive. When an external table is deleted, data for that table still exists in that directory in the cluster.
Syntax: CREATE EXTERNAL TABLE employee(id int, name string, salary int);
Hive creates internal tables by default. An internal table is a table where the metadata as well as the table data is managed by Hive. If the table is deleted, associated data and directory gets deleted as well.
Syntax: CREATE TABLE department(dept_id int, dept_name string);
View is a logical representation of data present in a Hive table. It shows a snapshot of the data at any given point in time. A view is created on top of a Hive table generally to have a custom view of the data from the source table.
Syntax: CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE salary > 30000;
Table location refers to the location where data is stored for a particular table. Tables locations can be internal or external. Note that table location has nothing to do with table type, meaning an external table can have an internal location and vice-versa.
Internal Table Location
If Hive manages the directory for storing data of the table, it is known as an internal path. If the location is not mentioned in a Hive create table query, the table’s data path is created in an internal location inside the Hive warehouse.
External Table Location
A Hive table is said to have an external path if the directory location is outside the warehouse location. An external location can be in HDFS or S3 or any other file storage.
Example: CREATE TABLE department(dept_id int, dept_name string) LOCATION ‘/tmp/database/department’;
Grouping of Data in Apache Hive
Hive comes with multiple approaches to organize data so that data can be processed and accessed faster.
A Hive table can have one or more partitions. A partition is a physical grouping of data based on a key. A partition can be considered a Hive table field that helps in data look-up based on a particular key. Consider an example where you are storing some registration details in Hive. Users can register in your application from around the world. Over time, your table will grow, and getting details for a particular country would impact performance. To improve the performance, we can partition the table based on country. New directories are created for each country partitioned, and all registrations for that country are stored in that directory. While fetching the data, the user has to specify the partition value as the country for which reports are required. Hive will now read data only from that particular directory, and hence data access will be faster.
Create Table with Partition
CREATE TABLE REGISTRATION_DATA( userid BIGINT,First_Name STRING, Last_Name STRING, address1 STRING, address2 STRING, city STRING, zip_code STRING, state STRING) PARTITION BY (COUNTRY STRING);
Load data into a partition
LOAD DATA INPATH '/path/to/HDFS/dir/file.csv' OVERWRITE INTO TABLE REGISTRATION_DATA PARTITION(COUNTRY=’USA’); Fetch data from partition SELECT * FROM REGISTRATION_DATA where COUNTRY=’USA’
NOTE: Partition works a bit differently for Hive VIEW. A partition in a View is logical grouping of data. Unlike tables where partition is separate from columns of the table, in case of view, partition has to be a field of the view. One can use right most columns in the same order to participate as partitions of the Hive View. A Hive View with N fields can have at max N-1 partitions, meaning at least the first column of the view query should be a non partitioning field.
CREATE VIEW REGISTERED_USERS (First_Name, Last_Name, city, state) PARTITIONED ON (city, state) AS SELECT First_Name, Last_Name, city, state FROM REGISTRATION_DATA;
Bucketing is a logical grouping of data based on a hashing algorithm that stores data with the same hash code in one bucket. Note that bucketing can be done without partitioning as well.
Data in each partition may be divided into Buckets based on the value of some Table columns’ hash function. For example, the REGESTRATION_DATA table may be bucketed by state, which is one of the table columns. This can be used to sample the data efficiently.
The first step in using bucketing is to identify the number of buckets required for your data set. Generally 10 buckets is a good number. Then you need to decide which column is most suited for bucketing. It depends on your use case. In my case I may be interested to get reports for users from a particular state of a particular country. To address this one way would be to create a partition within the existing country partition. But we know that states of any country cannot grow exponentially. It will generally be a finite number. So in such a case we can opt for bucketing the partitioned data based on state names. By bucketing, states with the same hash code will be stored together. When we query for a particular state Hive will now know from which bucket data has to be fetched, which results in further improvement of performance.
CREATE TABLE REGISTRATION_DATA( userid BIGINT,First_Name STRING, Last_Name STRING, address1 STRING,address2 STRING, city STRING, zip_code STRING, state STRING) PARTITION BY (COUNTRY STRING) CLUSTERED BY(state) INTO 5 BUCKETS;
Arena Makes It Easy
Arena provides a 360-degree view of your data lake. One of the key components of a data lake is Hive, responsible for cataloging your big datasets. The Catalog feature of Arena comes packed with all the features discussed in this article and many more. With intuitive UX and RESTful APIs, one can easily create all kinds of Hive tables. Users can easily modify and group data using partitions and buckets, sort data, get an instant preview of your data lake, run workflows, and perform ETL operations. And all of these from the application itself without having to write complex queries.