Blogs

How to Read Data from Hive & Write to MS SQL Table Using Spark-Shell

Avatar photo Amar Gurung January 13th, 2022

Blog Overview:

This blog walks readers through the proper use case pre-requisites and steps to read data from Hive and write to MS SQL Server Table use case.

Pre-requisites for Use Case:

  1. scala
  2. spark-shell
  3. beeline
  4. hive 3.1.0
  5. spark 2.3.2
  6. mssql-cli v: 1.0.0 ( Refer https://github.com/dbcli/mssql-cli/blob/master/doc/installation/linux.md )
  7. Driver sqljdbc4–2.0.jar

Use Case Steps:

  1. Create Hive Table using Beeline
create database organization_db;

create table employee(id int,name string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;
  1. Create a CSV file & copy it to the HDFS path
[~]$ cat emp.csv

1,rohan

2,amar

3,jeet

[~]$ hadoop fs -put emp.csv hdfs://hdp314/warehouse/tablespace/managed/hive/organization_db.db/employee

[~]$ hadoop fs -cat hdfs://hdp314/warehouse/tablespace/managed/hive/organization_db.db/employee/emp.csv

1,rohan

2,amar

3,jeet
  1. Create MS SQL Table
mssql-cli -U <username> -P <password> -S <sql server ip/hostname> -d <db_name>

db_name> create database orgdb;

Time: 1.005s (a second)

Commands completed successfully.

db_name> use orgdb;

Time: 0.502s

Commands completed successfully.

orgdb;> create schema org;

Time: 0.453s

Commands completed successfully.

orgdb;> create table org.employee(id int primary key , name varchar(1024));

Time: 0.453s

Commands completed successfully.
  1. Scala Code
val hiveDf = spark.sql("select * from organization_db.employee")

hiveDf.show()

val df = hiveDf.write.format("jdbc").mode(org.apache.spark.sql.SaveMode.Overwrite)

df.option("url","jdbc:sqlserver://<sql-server-ip>:1433;databaseName=orgdb").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable", "org.employee").option("user", "sa").option("password","xxxxx").save()
  1. Invoke spark-shell & paste the above code
spark-shell --master yarn --jars /<path-to-driver>/sqljdbc4-2.0.jar
  1. Sample output of the spark-shell
[~]$ spark-shell --master yarn --jars <driver-path>/sqljdbc4-2.0.jar

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at https://xxxxx:4041

Spark context available as 'sc' (master = yarn, app id = application_1630517835182_0012).

Spark session available as 'spark'.

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 2.3.2.3.1.4.0-315

/_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_271)

scala> val hiveDf = spark.sql("select * from organization_db.employee")

hiveDf: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> hiveDf.show()

+---+-----+

| id| name|

+---+-----+

|  1|rohan|

|  2| amar|

|  3| jeet|

+---+-----+

scala> val df = hiveDf.write.format("jdbc").mode(org.apache.spark.sql.SaveMode.Overwrite)

df: org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row] = org.apache.spark.sql.DataFrameWriter@21838ea3

scala> df.option("url","jdbc:sqlserver://xxxxx:1433;databaseName=orgdb").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable", "org.employee").option("user", "sa").option("password","xxxx").save()
  1. Validate the data in MS SQL Table
[amar@crrt-edge ~]$ mssql-cli -U <user_name> -P <password> -S <sql server ip/hostname>-d orgdb

orgdb> select * from org."employee";

Time: 4.674s (4 seconds)

+------+--------+

| id   | name   |

|------+--------|

| 3    | jeet   |

| 1    | rohan  |

| 2    | amar   |

+------+--------+

(3 rows affected)


Using Spark-Shell

about the author

Amar is a senior module lead as a part of Zaloni's engineering team.