When it comes to transferring huge amounts of bulk data between relational databases and Hadoop, Sqoop is a very effective tool. It also has a few issues. Here are some hacks that we’ve found useful for working around these issues we’ve seen with Sqoop.
We had to import a number of tables from Oracle database into Hive. In order to do that, we were using the options --hive-drop-import-delims
or --hive-delims-replacement
in our Sqoop commands to handle delimiters in data.
They worked perfectly until we encountered the data types CLOB, ROWID, BLOB and BFILE. The column of CLOB data type had the line feed/carriage return and ctrl-A characters as part of data.
Since the options for --hive-drop-import-delims
or --hive-delims-replacement
did not work in the Oracle database, the Sqoop process was detecting the line feed/carriage returns as new record markers, which resulted in issues with validations with regards to record count and schema integrity. The options for delimiter handling have seen to be working for other database sources, but in the Oracle database, it did not.
We saw similar issues with the ROWID data type as well. Again, doing a traditional import failed.
We use table-based or query-based import in the Sqoop import tool to import the data into HDFS. In this scenario, we used query-based import to bring the data in.
To bring in the CLOB/ROWID columns, we did some additional character operations on the corresponding column data.
In the query that is provided in the Sqoop tool, all the columns of the table were selected, but instead of select *, a more precise select col1, col2.. was used. This allowed us to deal with the particular column of CLOB, ROWID data types and do necessary character replacement operations.
e.g.
Let EMPLOYEE(Name String, Position String, Current_Project CLOB, RowIdentifier ROWID) be the table in question.
Now, the query that we executed for sqooping the EMPLOYEE table from Oracle to Hive is as mentioned below:
sqoop import $HADOOP_OPTS \
–options-file $OPTIONS_FILE \
–query “${SQL} where \$CONDITIONS” \
–mapreduce-job-name “${TBL}_LOAD” \
–target-dir $TARGET_DIR/$TABLE_UPPER \
–fields-terminated-by ‘\001’ \
–null-string “” \
–null-non-string “” \
–escaped-by “\\” \
–split-by $SPLIT_BY_COLUMN \
–num-mappers $NUM_MAPPERS \
–outdir $SCRIPT_PATH/javaClasses
Where
SQL=SELECT Name, Position, REPLACE(REPLACE(REPLACE(Current_Project, CHR(10), ‘ _ ‘), CHR(13), ‘ _ ‘), CHR(01), ‘ _ ‘) AS Current_Project, REPLACE(REPLACE(REPLACE(RowIdentifier,CHR(10), ‘ _ ‘), CHR(13), ‘ _ ‘), CHR(01), ‘ _ ‘) AS RowIdentifier FROM $SCHEMA.EMPLOYEE
CHR(10) — Line feed
CHR(13) — Carriage return.
In the particular column, each “problematic” character was treated (replaced) one by one with a character of our choice, in this case, the Underscore ( _ ).
Columns of these two data types can not be imported into HDFS using the Sqoop import without losing or altering the data. The BLOB columns can be imported into HDFS; however the content is in a hexadecimal format. The BFILE columns are not supported in the Sqoop import tool.
We can use a table-based or query-based import in the Sqoop import tool to import the data into HDFS. In this workaround, we will use a query-based import to bring the data in.
As of now, we are skipping any columns with the data types BLOB or BFILE.
In the query that is provided in the Sqoop import tool, all the columns of the table were selected, but instead of select *, a more precise select col1, col2.. was used. This allowed us to deal with the particular column of BLOB, BFILE data types and skip them accordingly. The BLOB types should be possible to handle by adding an additional data loading operation after the Sqoop import is done.
e.g.
Let EMPLOYEE(Name String, Position String, Current_Project BLOB, Project_Documents BFILE) be the table in question.
Now, the query that we executed for sqooping the EMPLOYEE table from Oracle to Hive is as mentioned below:
sqoop import $HADOOP_OPTS \
–options-file $OPTIONS_FILE \
–query “${SQL} where \$CONDITIONS” \
–mapreduce-job-name “${TBL}_LOAD” \
–target-dir $TARGET_DIR/$TABLE_UPPER \
–fields-terminated-by ‘\001’ \
–null-string “” \
–null-non-string “” \
–escaped-by “\\” \
–split-by $SPLIT_BY_COLUMN \
–num-mappers $NUM_MAPPERS \
–outdir $SCRIPT_PATH/javaClasses
Where
SQL=SELECT Name, Position FROM $SCHEMA.EMPLOYEE
We use Split-by and more than one mappers wherever possible to speed up the import process. However, there are certain data types that we have found that are not applicable for split-by.
To date for this example, the following types are not applicable for Split-by: text, varchar, string, nvarchar, uniqueidentifier.
As a solution to this, we have made sure that we use single mappers in such cases. Since our workflows use generic formats for all ingestion, we’ve written some custom code to manipulate the Split-by and Mapper count for each ingestion before Sqoop is triggered.
Hence, if the split-by column provided is of any of the identified data types, we must alter the mapper count for the particular table’s ingestion to one (1).
These are only a few of the issues with Sqoop to date that we’ve discovered elegant workarounds for. Subscribe to Zaloni Resources for future posts about workarounds for big data and Hadoop data lakes.
News By: Team Zaloni
Blogs By: Matthew Caspento
Blogs By: Haley Teeples