The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. For example, if a table has two columns, id, name and age; and is partitioned by age, all the rows having same age will be stored together.
Need an expert solution, quickly? Zaloni’s FastTrack Package provides the data management software and expertise to solve your most pressing data challenge . . . in just 6 weeks!
Let us consider a non-partitioned managed table as follows:
create table salesdata_source
(salesperson_id int,
product_id int,
date_of_sale string)
salesperson_id | product_id | date_of_sale |
---|---|---|
12 | 101 | 10-27-2017 |
10 | 10010 | 10-27-2017 |
111 | 2010 | 10-27-2017 |
13 | 222 | 10-28-2017 |
15 | 235 | 10-28-2017 |
We have another table with the same schema, but partitioned on the column ‘date_of_sale.’
create table salesdata
(salesperson_id int,
product_id int)
partitioned by (date_of_sale string)
—–Please note that the partition column need not be mentioned in the table schema separately.
Now, let us insert data from salesdata_source to salesdata. There are two ways of inserting the data:
We specify that value of the partition while inserting the data.
insert into table salesdata partition (date_of_sale=’10-27-2017’)
select * from salesdata_source where date_of_sale=’10-27-2017’;
insert into table salesdata partition (date_of_sale=’10-28-2017’)
select * from salesdata_source where date_of_sale=’10-28-2017’;
This will create two partitions in salesdata and will insert all the data from salesdata_source to salesdata. We can verify that by the following command:
show partitions salesdata;
O/p:
date_of_sale=’10-27-2017’
date_of_sale=’10-28-2017’
The table will have the same data as the source table:
salesperson_id | product_id | date_of_sale |
---|---|---|
12 | 101 | 10-27-2017 |
10 | 10010 | 10-27-2017 |
111 | 2010 | 10-27-2017 |
13 | 222 | 10-28-2017 |
15 | 235 | 10-28-2017 |
But internally, the data for each partition will be stored as separate files under separate subdirectories. We can see the physical storage by going to the HDFS location of the table.
hadoop fs -ls /apps/hive/warehouse/db_name.db/salesdata/
O/p:
/apps/hive/warehouse/db_name.db/salesdata/date_of_sale=10-27-2017
/apps/hive/warehouse/db_name.db/salesdata/date_of_sale=10-28-2017
So as you can see, there are sub-directories created for each partition, and the data files for each partition are stored under that.
There is another way of partitioning where we let the Hive engine dynamically determine the partitions based on the values of the partition column.
Before using this, we have to set a property that allows dynamic partition:
set hive.exec.dynamic.partition.mode=nonstrict;
(This is because Dynamic Partitioning is disabled in Hive to prevent accidental creation of huge number of partitions)
hive> insert into table salesdata partition (date_of_sale)
select salesperson_id,product_id,date_of_sale from salesdata_source ;
— Please note that the partitioned column should be the last column in the select clause.
In this method, Hive engine will determine the different unique values that the partition columns holds(i.e date_of_sale), and creates partitions for each value.
We can see that with the following command:
hive> show partitions salesdata;
date_of_sale=’10-27-2017’
date_of_sale=’10-28-2017’
The maximum number of partitions that can be created by default is 200. We can increase this number by using the following queries:
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
Let us fire two queries on the above tables:
select * from salesdata_source where salesperson_id=12 and date_of_sale =’10-28-2017’
The above query will scan the entire table in search of this particular salesperson_id and date_of_sale. But, if the same query is fired for salesdata (which is partitioned),
select * from salesdata where salesperson_id=12 and date_of_sale =’10-28-2017’
only the partition date_of_sale =’10-28-2017’ will be scanned thus reducing the query execution time by a large margin. The table we have used here is very small, the difference in execution time will be better understood in case of huge datasets.
So we have seen how to insert data into partitioned tables from other tables. Let’s see how to handle data that is already present in HDFS.
Consider a file /user/hive/text1.dat which has the following data:
salesperson_id|product_id|date_of_sale
12|101|10-27-2017
10|10010|10-27-2017
111|2010|10-27-2017
We see that this file has data corresponding to date_of_sale=’10-27-2017’. A partitioned external table can be used to populate this data.
create external table salesdata_ext
(salesperson_id int,
product_id int)
partitioned by (date_of_sale string)
location ‘/user/hive/salesdata_ext/’
Now, we will create a subdirectory under this location manually, and move the file here.
hadoop fs -mkdir /user/hive/salesdata_ext/date_of_sale=10-27-2017
hadoop fs -cp /user/hive/text1.dat /user/hive/salesdata_ext/date_of_sale=10-27-2017/
So the file has been placed into the partition folder. Now we can see if the partition information is reflected in the table data:
Show partitions salesdata_ext;
–0
For the partition to reflect in the table metadata, we will either have to repair the table or add partition by using the alter command that we are discussing later.
msck repair table salesdata_ext;
show partitions salesdata_ext;
O/p:
date_of_sale=10-27-2017
OR
alter table salesdata_ext add partition (date_of_sale=’10-27-2017’);
show partitions salesdata_ext;
O/p:
date_of_sale=10-27-2017
Note: Using external partitioned table is a design choice. The file from HDFS can be loaded into a managed non-partitioned table as well, and from that, into a partitioned table as discussed earlier.
Suppose we have created partitions for a table, but we need to rename a particular partition or drop a partition that got incorrectly created. We can use the ‘alter’ command in such cases:
1. alter table salesdata partition (date_of_sale=10-27-2017) rename to partition (date_of_sale=10-27-2018);
Partition name will be changed(can be verified by show partitions), and the subdirectory name in the warehouse will be changed, but the data underneath will remain same.
2. alter table salesdata drop partition (date_of_sale=10-27-2017) ; (internal table)
Partition will be dropped and the subdirectory will be deleted.
3. alter table salesdata_ext drop partition (date_of_sale=10-27-2017) ; (external table)
Partition will be dropped but the subdirectory will not be deleted since this is an external table.
Conversely, if we delete the subdirectory but do not drop the partition using alter command, the partitions will remain in both external and managed tables, until we don’t execute the alter table drop partition command for the deleted partition.
News By: Team Zaloni
Blogs By: Matthew Caspento
Blogs By: Haley Teeples