Blogs

Sqoop Hacks: 3 Ways to Get the Most of the Tool

Avatar photo Team Zaloni June 21st, 2017

Sqoop Hacks

When it comes to transferring huge amounts of bulk data between relational databases and Hadoop, Sqoop is a very effective tool. It also has a few issues. Here are some hacks that we’ve found useful for working around these issues we’ve seen with Sqoop.

Sqoop Challenge #1: Importing tables from Oracle to Hive

We had to import a number of tables from Oracle database into Hive. In order to do that, we were using the options --hive-drop-import-delims or --hive-delims-replacement in our Sqoop commands to handle delimiters in data.

They worked perfectly until we encountered the data types CLOB, ROWID, BLOB and BFILE. The column of CLOB data type had the line feed/carriage return and ctrl-A characters as part of data.

Since the options for --hive-drop-import-delims or --hive-delims-replacement did not work in the Oracle database, the Sqoop process was detecting the line feed/carriage returns as new record markers, which resulted in issues with validations with regards to record count and schema integrity. The options for delimiter handling have seen to be working for other database sources, but in the Oracle database, it did not.

We saw similar issues with the ROWID data type as well. Again, doing a traditional import failed.

The workaround 

We use table-based or query-based import in the Sqoop import tool to import the data into HDFS. In this scenario, we used query-based import to bring the data in.

To bring in the CLOB/ROWID columns, we did some additional character operations on the corresponding column data.

In the query that is provided in the Sqoop tool, all the columns of the table were selected, but instead of select *, a more precise select col1, col2.. was used. This allowed us to deal with the particular column of CLOB, ROWID data types and do necessary character replacement operations.

e.g.
Let EMPLOYEE(Name String, Position String, Current_Project CLOB, RowIdentifier ROWID) be the table in question.

Now, the query that we executed for sqooping the EMPLOYEE table from Oracle to Hive is as mentioned below:

sqoop import $HADOOP_OPTS \
–options-file $OPTIONS_FILE \
–query “${SQL} where \$CONDITIONS” \
–mapreduce-job-name “${TBL}_LOAD” \
–target-dir $TARGET_DIR/$TABLE_UPPER \
–fields-terminated-by ‘\001’ \
–null-string “” \
–null-non-string “” \
–escaped-by “\\” \
–split-by $SPLIT_BY_COLUMN \
–num-mappers $NUM_MAPPERS \
–outdir $SCRIPT_PATH/javaClasses

Where

SQL=SELECT Name, Position, REPLACE(REPLACE(REPLACE(Current_Project, CHR(10), ‘ _ ‘), CHR(13), ‘ _ ‘), CHR(01), ‘ _ ‘) AS Current_Project, REPLACE(REPLACE(REPLACE(RowIdentifier,CHR(10), ‘ _ ‘), CHR(13), ‘ _ ‘), CHR(01), ‘ _ ‘) AS RowIdentifier FROM $SCHEMA.EMPLOYEE

CHR(10) — Line feed
CHR(13) — Carriage return.

In the particular column, each “problematic” character was treated (replaced) one by one with a character of our choice, in this case, the Underscore ( _ ).

Sqoop Challenge #2: Handling BLOB & BFILE Columns

Columns of these two data types can not be imported into HDFS using the Sqoop import without losing or altering the data. The BLOB columns can be imported into HDFS; however the content is in a hexadecimal format. The BFILE columns are not supported in the Sqoop import tool.

The workaround 

We can use a table-based or query-based import in the Sqoop import tool to import the data into HDFS. In this workaround, we will use a query-based import to bring the data in.

As of now, we are skipping any columns with the data types BLOB or BFILE.

In the query that is provided in the Sqoop import tool, all the columns of the table were selected, but instead of select *, a more precise select col1, col2.. was used. This allowed us to deal with the particular column of BLOB, BFILE data types and skip them accordingly. The BLOB types should be possible to handle by adding an additional data loading operation after the Sqoop import is done.

e.g.
Let EMPLOYEE(Name String, Position String, Current_Project BLOB, Project_Documents BFILE) be the table in question.

Now, the query that we executed for sqooping the EMPLOYEE table from Oracle to Hive is as mentioned below:

sqoop import $HADOOP_OPTS \
–options-file $OPTIONS_FILE \
–query “${SQL} where \$CONDITIONS” \
–mapreduce-job-name “${TBL}_LOAD” \
–target-dir $TARGET_DIR/$TABLE_UPPER \
–fields-terminated-by ‘\001’ \
–null-string “” \
–null-non-string “” \
–escaped-by “\\” \
–split-by $SPLIT_BY_COLUMN \
–num-mappers $NUM_MAPPERS \
–outdir $SCRIPT_PATH/javaClasses

Where

SQL=SELECT Name, Position FROM $SCHEMA.EMPLOYEE

Sqoop Challenge #3: Using –split-by

We use Split-by and more than one mappers wherever possible to speed up the import process. However, there are certain data types that we have found that are not applicable for split-by.

To date for this example, the following types are not applicable for Split-by: text, varchar, string, nvarchar, uniqueidentifier. 

The workaround

As a solution to this, we have made sure that we use single mappers in such cases. Since our workflows use generic formats for all ingestion, we’ve written some custom code to manipulate the Split-by and Mapper count for each ingestion before Sqoop is triggered.

Hence, if the split-by column provided is of any of the identified data types, we must alter the mapper count for the particular table’s ingestion to one (1).


These are only a few of the issues with Sqoop to date that we’ve discovered elegant workarounds for. Subscribe to Zaloni Resources for future posts about workarounds for big data and Hadoop data lakes.

secure the unrealized power of your data click for demo

about the author

This team of authors from Team Zaloni provide their expertise, best practices, tips and tricks and use cases across varied topics incuding: data governance, data catalog, dataops, observability, and so much more.