Technical: Hadoop – Sqoop on Cloudera (CDH) – Is Sqoop Setup and configured for MS SQL Server?
As part of my roadmap towards Hadoop understanding, I am looking to how to use Sqoop.
What is Sqoop?
The name Sqoop is an acronym for “SQL to Hadoop“.
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Document Current Configuration
Launch web browser and connect to your services hub:
So here are the services:
No sqoop. what to to?
The items listed above are all services. But, Sqoop is not a service – Services are always running. Sqoop only needs to be activated when you need to import\export data.
To figure out how SQLOOP is packaged for our Hadoop version we consulted with:
Is Sqoop installed?
Is Sqoop installed? The truth is I do not know.
So let us try the brute force approach.
Issue a command to check the Sqoop Version#
Syntax: sqoop version Output: Sqoop 1.4.2-cdh4.2.0 git commit id 8885ea41b504f0b8f16f56b43b083df592b192da Compiled by jenkins on Fri Feb 15 12:07:28 PST 2013
What does all this mean?
- The version# is Sqoop 1.4.2-cdh4.2.0
- The compile date is 2013-02-15
So from everything, it looks like we are good. The most recent sqoop for our distribution is installed and runnable.
Identify – Database Drivers
Depending on which database we want to communicate with, we have to install specific Database Drivers.
As Hadoop is mostly Java based, JDBC Drivers in this case.
Identify – Database Drivers – MS SQL Server
Microsoft SQL Server JDBC Drivers are quite backward compatible; that is irrespective of the version of MS SQL Server you want to communicate with, you can usually get and install the latest drivers.
The latest Microsoft SQL Server is Microsoft JDBC Driver 4.0 for SQL Server.
Here is a screen shot:
Nice. One for Microsoft Windows (sqljdbc_4.0.2206.100_enu.exe), and another for everyone else (sqljdbc_4.0.2206.100_enu.tar.gz).
This one time we are the everyone else.
Download via web browser or mget, etc.
Copy Database Drivers to Hadoop Node
Copy database driver to working Hadoop Node
Syntax: --create new folder on remote host ssh -l <user> <host-name> "mkdir <remote-folder-name>" --scp file to remote host; into remote folder scp local-file-name hadoop-node:<remote-folder-name> Sample: --make new folder /tmp/sqljdbc4 ssh -l dadeniji hadoop-node "mkdir -p /tmp/sqljdbc4" --copy downloaded file to hadoop-node scp /Users/daniel_adeniji/Downloads/sqljdbc_4.0.2206.100_enu.tar.gz \ dadeniji@hadoop-node:/tmp/sqljdbc4
Extract compressed file (tar.gz):
extract compressed file.
Syntax: --make tar folder, the new current folder cd <dest-folder-name> --untar file, and place contents in current folder tar xvfz <file-name> Sample: --cd folder cd /tmp/sqljdbc4 --copy downloaded file to hadoop-node tar xvfz sqljdbc_4.0.2206.100_enu.tar.gz
Review extracted files:
review the extracted files.
Syntax: --review extracted folders/files ls -la sqljdbc_4.0/enu Sample: ls -la sqljdbc_4.0/enu
Here is what our folder and file structure looks like for sqljdbc_4.0/enu:
Which file shall we choose:
As we have two files (sqljdbc4.jar and sqljdbc.jar), which one shall we choose.
Well it depends on which JRE is installed?
So what does all this mean?
- Well JDBC is a specification. MS SQL JDBC4 supports both JDBC 3.0 and JDBC 4.0 specifications
- Our choice of which file (sqljdbc.jar or sqljdbc4.jar) rests upon which JRE we are using
Let us go find out which JRE we are using?
Syntax: java -version
Well, easy enough, our version# is 1.70_17-b02
But, keep me honest here, let us draw this out.
Microsoft’s compatibility matrix refers specifically to JRE v5.0, v6.0, v7.0
And, so to be sure I want to know how version#1.7xx maps to JRE v 5, 6, or 7.
The best answer came via:
Java™ Platform Standard Edition 7 Names and Versions
The web page says to take the number(s) following the first period (.) and take that has the Product Number (#); ie the marketing#
So we are running JRE Version 7 and thus we use either sqljdbc.jar or sqljdbc4.jar
We will go with sqljdbc4.0 and revert to sqljdbc if we encounter errors.
Identify Hadoop\sqoop Connection Points
Now that we have the jar file, we need to know where to place it.
SQOOP User Guide
You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import, and install the .jar file in the
$SQOOP_HOME/libdirectory on your client machine. (This will be
/usr/lib/sqoop/libif you installed from an RPM or Debian package.) Each driver
.jarfile also has a specific driver class which defines the entry-point to the driver. For example, MySQL’s Connector/J library has a driver class of
com.mysql.jdbc.Driver. Refer to your database vendor-specific documentation to determine the main driver class. This class must be provided as an argument to Sqoop with
Where is $SQOOP_HOME/lib?
All this talk about Carmen Santiago is making me dizzy? The toughest thing with me about Linux is there are so many ways to do the same thing…
If it is not symbolic links, it is alternatives
Let us go look for sqoop
Syntax: which <find-name> Sample: which sqoop
Syntax: ls -la <find-name> Sample: ls -la /usr/bin/sqoop
/usr/bin/sqoop is /etc/alternatives/sqoop
get Listing for /etc/alternatives/sqoop
Syntax: ls -la <alternative-name> Sample: ls -la /etc/alternatives/sqoop
—- is —
I already know that I probably should not bother to look for the lib folder ( or jar files) in a bin folder. But, I am in learning mode, and so I issue:
ls -la /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/bin
So there you go, no lib, no jar files
Google for Sqoop Libraries
Googled for help and found “where-is-the-sqoop-library-directory” ( http://stackoverflow.com/questions/15502220/where-is-the-sqoop-library-directory )
Mad Echet answered his own question:
Confirm Sqoop Library Folder
Get folder listing for /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib
ls -la /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib
Listing for /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib folder listing
Configure Hadoop\sqoop Connection Points
Configure Hadoop\sqloop Connection Points – Microsoft SQL Server
Copy Microsoft SQL Server Files
The Microsoft SQL Server files are primarily Jar files. In our case, a lone Jar file; named sqljdbc4.jar
Let us go copy it:
syntax: sudo cp <sql-jdbc-extracted-file> <sqoop-lib sudo cp /tmp/sqljdbc4/sqljdbc_4.0/enu/sqljdbc4.jar \ /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib
Nomenclature – Microsoft SQL Server
Database / Driver Specific
- Specifies specific database name (hrdb)
( be sure to enclose in double-quotes)
Processing – Microsoft SQL Server
Command – List Databases
Syntax sqoop list-databases --connect jdbc:sqlserver://sqlserver-name / --username <username> / --password <password> / --driver <driver-manager-class> Sample sqoop list-databases --connect jdbc:sqlserver://labDB / --username sqoop / --password simp1e / --driver com.microsoft.sqlserver.jdbc.SQLServerDriver
WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time. INFO manager.SqlManager: Using default fetchSize of 1000 ERROR manager.SqlManager: Generic SqlManager.listDatabases() not implemented. Could not retrieve database list from server ERROR tool.ListDatabasesTool: manager.listDatabases() returned null
Command – List tables
List the tables
Syntax sqoop list-tables --connect jdbc:sqlserver://sqlserver-name / --username <username> / --password <password> / --driver <driver-manager-class> Sample sqoop list-tables --connect jdbc:sqlserver://labDB / --username sqoop / --password simp1e / --driver com.microsoft.sqlserver.jdbc.SQLServerDriver
Command – eval
Execute SQL Command
Syntax sqoop eval --connect jdbc:sqlserver://sqlserver-name / --username <username> / --password <password> / --driver <driver-manager-class> / --query <query> Sample sqoop eval --connect jdbc:sqlserver://labDB / --username sqoop / --password simp1e / --driver com.microsoft.sqlserver.jdbc.SQLServerDriver / --query / " select schema_name(uid) as schemaName, name, crdate / from dbo.sysobjects where type = 'U' "