September 17th, 2020
This article is intended to explain the hive metastore management, significance of a shared metastore and how it can be quickly and securely set up using AWS Glue and Amazon EMR. For security, we’ll show you how to use Apache Ranger. Finally, we take a look at how Zaloni makes it easier to set up an EMR cluster with a Glue metastore secured by Ranger.
Before getting started with the setup let’s first understand a few of the key terminologies related to big data to give a better understanding of why this setup is required.
Apache Hive is an open source data warehousing software that facilitates reading, writing, and managing large volumes of structured data residing in distributed storage. It’s a wrapper over Apache Hadoop that enables visualizing data in a tabular format and uses SQL like queries, known as Hive Query Language (HiveQL/HQL), to analyze and get insights from the underlying data.
Hive comes with a command-line tool and a JDBC driver to connect users to hive. The command line tool can be accessed by simply typing hive in a terminal. To access hive using JDBC, beeline is the most preferred choice.
As Hive represents underlying data in tabular form, it needs a metastore service to store metadata corresponding to the data. It is implemented using tables in a relational database. Hive metastore is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions in Hive.
By default, Hive uses a built-in Apache Derby SQL server. Derby is a single process storage which means only one instance of Hive CLI will be supported by Derby. This is good for personal use or small computing, but when Hive is required at a production level or in a cluster it is recommended to use relational databases like MySQL or Oracle.
To change metastore from Derby to Mysql or any other relational database you will have to change the following properties in hive-site.xml
<description>metadata is stored in a MySQL server</description>
<description>user metadata is stored in a MySQL server</description>
<description>MySQL JDBC driver class</description>
<description>username for connecting to mysql server </description>
<description>password for connecting to mysql server </description>
Note: With Hive 2.0, SQL server doesn’t work as an underlying metastore. So the best bet is to go with MySQL.
As the name suggests, if multiple clusters use a common space/database to store meta-information from their respective Hive, then it is known as a shared metastore. Well, why do we need such a thing? It turned out that with the amount of data that gets generated every second, it’s not possible to depend on a single cluster to take the complete load. To take full advantage of distributed systems generally, organizations rely on multi-cluster to distribute the compute.
At the same time organizations look for a single place of truth to look for the metadata of their data lake. This is where a shared metastore comes into the picture. A shared metastore is a single repository that stores Hive information for all connected clusters. The information can be accessed from any of the clusters. Actual data may or may not be accessible depending on the permissions/availability of the data in that particular cluster.
To configure a shared metastore you may set the following property in hive-site.xml of each cluster. The value needs to be the same in all clusters.
<description>The Thrift URI of shared Hive metastore</description>
With the emergence of cloud computing, many organizations are rapidly migrating to cloud to take advantage of the flexibility that it offers. One of the key benefits of the cloud is that it provides the capability of having ephemeral clusters that are just responsible for the compute and not for storing data.
One can scale up or down compute capabilities and once the compute is done, the cluster can be terminated. The major problem with terminating or destroying the cluster after compute is that the data computed and the metadata will be lost. Thanks to services like Amazon S3 and Glue that this information is preserved. While S3 is a resilient service to store data in the cloud, the Glue catalog helps with persisting metadata for ephemeral clusters.
Glue is a serverless, fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. Glue can be configured to use as a shared metastore for EMRs.
Amazon Elastic MapReduce (Amazon EMR) is an industry-leading cloud big-data processing platform from AWS that helps to compute large amounts of data using open source tools like Apache Spark, Apache Hive, Apache Hbase, etc. EMRs are best used only for processing. Once the processing is done, the EMR is generally terminated and a new one comes up when further processing is required. EMRs can be configured to scale up and down depending on load. As EMRs are ephemeral (does its job and terminates without persisting data) it is advised to store data in a persistent layer like S3. EMR comes with the provision to store metadata in the Glue catalog. This helps in persisting and reusing metadata during batch processing of a big data application.
Using Glue Data Catalog for Hive metastore management is very easy in EMR. Unlike on-prem setups where you need to change the value of a property in hive-site.xml, in EMR it is just a matter of a single click. Once you land on the EMR creation page, you will see a checkbox to Use AWS Glue Data Catalog for table metadata. Check this checkbox and you are all done.
One thing to note is that for an account there is only one Glue Data Catalog. So all EMRs within the account that will use Glue as the metastore will share the same Glue catalog, meaning the Hive databases, tables, partitions, etc. will be accessible from all EMRs that are using Glue to maintain metadata. Looks like an issue and potentially a security bridge right? Well, actually AWS has thought about it and provided two ways to resolve it.
You can attach policies to Glue catalog to restrict access to the meta information. This is good, but there is one problem with this. There is a risk that the policy may be altered, knowing or unknowingly, resulting in relaxing restrictions. Well, there is another way to secure your data and metadata- Ranger.
Apache Ranger is a centralized monitoring and data security management framework across the Hadoop platform. Ranger provides centralized administration, access control, and detailed auditing for user access across Hadoop, Hive, HBase, etc.
It is an easy and effective way to set access control policies and audit data access across the entire Hadoop stack by following industry best practices. A key benefit of Ranger is that access control policies can be managed by security administrators consistently across Hadoop environments. Ranger has a nice UI and published REST APIs that make it easy for administrators to manage security in the Hadoop cluster.
As Ranger is a third party software, the first question that may come to our mind is where to set up ranger for monitoring and access control. Clearly it won’t be along with Glue as it is serverless and there won’t be any instance for Glue. Can we set up Ranger in EMR? Sure we can. But it won’t be beneficial as once the EMR is terminated the policies will be lost and next time a new EMR is created, the user has to set up the policies again. Also if we set the policies in EMR, policies have to be set on each and every EMR. This is a kind of duplication and is error-prone.
A better solution would be to abstract ranger outside EMR so that users can create policies once and that gets associated with any new EMR that is launched for processing.
To achieve this follow the following steps:
Now that we have covered the high-level requirements for setting up EMR with Glue catalog and securing it using Ranger, it may feel overwhelming and one may think this process may take some time to get hold of things, understand them and finally configure the environment. There is no denying that there is a bit of a learning curve for one to understand the core working mechanism, but thanks to Zaloni’s homegrown ansible playbooks that will make the job very easy for even a newbie. A fully configurable playbook has all that is required to set up your environment, secure it with Ranger, and use Glue as the Hive metastore. (Note that the playbook is configurable to even use RDS or internal database as a metastore).
What is even better is it deploys Arena services as well so that you can have a better intuitive UI to play around with the cluster, perform ingestion, profile data, do some quality checks, and more.
There are certain limitations to a shared metastore that one should keep in mind while setting up a multi-clustered environment. A few of the key limitations are:
With organizations moving rapidly to cloud computing with ephemeral clusters, a shared metastore is a great option to persist metadata and not keep the compute machines running, thereby reducing cloud computing costs. AWS provides a cost-effective solution to process large chunks of data efficiently using EMR. EMR can be mapped with Glue Data Catalog to store meta information of the Hive tables used to process data. The data and the metadata can be secured using Ranger.
Zaloni provides an end-to-end solution, Arena, to set up EMR with Glue, secure with Ranger and visualize, ingest, catalog, profile, and run data quality, along with many more features for Hive metastore management. Arena has an intuitive UI and also published REST endpoints to make the job easier.
To learn more about Zaloni’s Arena platform or to schedule a custom demo, visit www.zaloni.com/demo.