Blogs

Provisioning to RDBMS with Spark for variable length data

Avatar photo Jun Ye November 19th, 2020

At Zaloni, we are always excited to share technical content with step-by-step solutions to common data challenges. This blog will address the extraction of processed data from a data lake into a traditional RDBMS “serving layer” using Spark for variable length data. By performing a data transfer into a “serving layer,” data engineers have the opportunity to better serve end-users and applications by providing high-quality data. 

Let’s begin with the problem statement. After data has been processed in the data lake, a common need is to extract some of it into a “serving layer” and for this scenario, the serving layer is a traditional RDBMS. Data engineers implementing the data transfer function should pay special attention to data type handling. In this blog post, we will start with some simple Spark code to transfer data from Hive to MySQL and then optimize the code for variable length data types.

Here is our sample Hive table called staff. It consists of three columns: id, name, and address. Note that while all of these are string types, each is defined with a different character length.


And here is some rudimentary code to transfer data from Hive to MySQL. It also creates the destination table (if it does not exist) in MySQL.

Tip: Remember to include the mysql-connector JAR when running this code

This creates and populates the employee table in MySQL. 

However, note that all the columns are created with the data type text. This is a good general-purpose default but since the data schema was set up with a tighter definition for these types in the source table, let’s see if we can do better than text in the destination.

Sidebar: Here is a quick recap of the differences between text and varchar in mysql, from this Stackoverflow thread.

Reasons to use TEXT:

  • If you want to store a paragraph or more of text
  • If you don’t need to index the column
  • If you have reached the row size limit for your table

Reasons to use VARCHAR:

  • If you want to store a few words or a sentence
  • If you want to index the (entire) column
  • If you want to use the column with foreign-key constraints

The Spark JdbcDialect can be used to override this default behavior and map the Java String type to a custom JDBC type. The code below uses varchar(255) as the mapped type so that the largest column in the source table can be accommodated.

Let’s look at the destination table and this time the column types are varchar instead of text.

This is an improvement from the first example, but all the varchars have the same size which is not optimal. For wide tables, an approach of sizing all columns to match the largest may not be viable.  

Let’s try to enhance the class MyJdbcDialect so that we can customize the size per column.  The method getJDBCType(…) should return the JdbcType with a custom size for each column but the only input argument to this method is the DataType, which is not sufficient to determine either the column name or the column size. The code presented below works around this limitation by saving the column name in the quoteIdentifier(…) method and then using this saved column name in the getJDBCType(…) method as a lookup key to identify the exact data type for that column. 

You can verify the table in mysql:

In the sample code above, the jdbcTypes map is hard-coded in order to keep the example small. In a real implementation, the map would be externalized and initialized from the source table definition so that the JdbcDialect subclass could be used for provisioning out any Hive table.

In this article, we presented a solution for transferring data from Hive to RDBMS such that the Spark generated schema of the target table leverages the variable length column types from the source table. This workaround for the limitation of the Spark API is based on knowledge of the current implementation of the Spark JdbcDialect.

Want more data content? Watch the webinar, Data Governance Framework for DataOps Success available on-demand

data control

 

about the author

Jun Ye is a Senior Module Lead on the product development team at Zaloni. He has been with Zaloni since January 2014 and plays a key role in developing Zaloni's software products and solutions. His technical expertise includes Java technologies, Spring, Apache Hive, Hadoop, Spark, AWS services, and Relational Databases.