Technical: Hadoop – Hive – Loading Data

Technical: Hadoop – Hive – Loading Data

Introduction

As a data-warehouse project, Hive does not support the traditional singleton database insert record statements.

Data inserts has to be through bulk-record modes.  One of those avenues is through a built-in load statement.

Data Model – Logical

Looked online for sample data models  that are representative of familiar data entities.  And, the one I finally settled on is a book entity.

It happens to be an XML representation:

http://msdn.microsoft.com/en-us/library/windows/desktop/ms762271(v=vs.85).aspx

I think we all remember when XML was going to take over the world as it crosses OS and cultural boundaries dues to its preciseness.

I am not sure how all that argument is now being jettisoned for no-sql.

Nevertheless, here is the Microsoft definition of a book entity:

Item Description
author String
title String
edition String
genre String
price Decimal
publishDate TimeStamp
description String
haveit bit

Data Model – Physical

Here is Hive’s implementation of our logical data model.

Item Description
author String
title String
edition String
genre String
price Decimal ( >= Hive 0.11)float ( < Hive 0.11)
publishDate TimeStamp ( >= Hive v 0.8.0)String (< Hive v 0.8.0)
description String
haveit binary ( >= Hive 0.8)tinyint ( < Hive 0.8)

Hive – Create Table

Prepare – Create Table Statement

It is a bit easier to just use an editor to enter your ddl statements.

We used vi to create /tmp/hive/helloWorld/dbBook__CreateTable__delimited.sql.


      drop table bookCharacterDelimited;

      create table bookCharacterDelimited
      (
           author      string
         , title        string
         , edition      string
         , genre        string
         , price        decimal
         , publishDate  timestamp
         , description  string
         , haveit       binary
         , acquireDate  timestamp

    )

    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '~'
    STORED AS TEXTFILE 
    ;

    quit;

Actual – Execute Create Table Statement

Once we have a file that contains our create statement, we invoke hive client and indicate that we will like to submit a sql file via -i.



   hive -i /tmp/hive/helloWorld/dbBook__CreateTable__Delimited.sql

Hive – Table – Data

Hive – Table – Data

Create a text file.  The name of the text file is /tmp/book.txt



Mohammed Khaled~Valley of the doors~1~Fiction1~10.45~2013-01-01~Afghan~0
Jackie Collins~Accident~2~12.56~TECH~27.87~2012-01-17~Lovely~1
Uwem Akpan~Say you're one of them~Fiction
Elizabeth Berg~Open House~Fiction
Kaye Gibbons~A Virtuos Woman
Khaled Hosseini~And the mountains echoed
Valorie Schaefer~The Care and Keeping of you~1
Malcom Gladwell~Outliers
Carmen Reinhart & Kenneth Rogoff~This time is different: Eight Centuries of financial folly
Steven Pinker~The Better Angels of our Nature: Why Violence has declined
Vivien Stewart~A World-Class Education

Hive – Table – Data – Load

Invoke Hive Client (hive) to load the data:


hive -e "load data local inpath '/tmp/book.txt' into table bookCharacterDelimited;"

Hive – Table – View Data

Hive – Table – View Data

Use hive query tool to view the data:


hive -e "select * from bookCharacterDelimited;"

Hive – Create Table – Problem Areas

Introduction

Let us try using a different delimiter (;)

Prepare – Create Table Statement

Create Statements placed in /tmp/hive/helloWorld/dbBook__CreateTable__Delimited__SemiColon.sql


      drop table bookCharacterDelimited__SemiColon;

      create table bookCharacterDelimited__SemiColon
      (
           author      string
         , title        string
         , edition      string
         , genre        string
         , price        decimal
         , publishDate  timestamp
         , description  string
         , haveit       binary
         , acquireDate  timestamp

    )

    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ';'
    STORED AS TEXTFILE 
    ;

    quit;

Actual – Execute Create Table Statement

Once we have a file that contains our create statement, we invoke hive client and indicate that we will like to submit a sql file via -i.



hive -i /tmp/hive/helloWorld/dbBook__CreateTable__Delimited__SemiColon.sql

Output (Textual):

  • FAILED: ParseException line 19:22 mismatched input ‘<EOF>’ expecting StringLiteral near ‘BY’ in table row format’s field separator

Output (Pictorial):

Hive -- CreateTable -- Expecting StringLiteral

Explanation:

  • It seems that the Hive application is quite picky about delimiters that are allowed
  • If delimiters are not specified, the default choice is ctrl-A

 

Hive – Create Table – Native

  • We created a new file /tmp/hive/helloWorld/dbBook__CreateTable__delimited.sql
  • A delimiter is not specified and so we assume the default of CTRL-A

      drop table booknative;

      create table booknative
      (
           author      string
         , title        string
         , edition      string
         , genre        string
         , price        decimal
         , publishDate  timestamp
         , description  string
         , haveit       binary
         , acquireDate  timestamp

    )

    ROW FORMAT DELIMITED
    STORED AS TEXTFILE 
    ;

    quit;

Replace original tilde (~) with CTRL-A

using sed or tr replace the original (~) with Ctrl-A (Hex 01)

In this case we chose to use tr:


cat dbBook_Data_in.txt | tr '~' $'\x01' > dbBook_Data_in_native.txt

Explanation:

  • The original file name is dbBook_data_in.txt
  • replace ~ with \x01
  • The resultant file name is dbBook_Data_in_native.txt

Load data into Hive Table

load data / over-write contents

hive -e "load data local inpath '/tmp/hive/helloWorld/dbBook_Data_in_native.txt' overwrite into table bookNative;'

Validate data

validate data:


hive -e "select * from bookCharacterDelimited;"

References

References – Hive – Data Definition Language

References – Hive – Data Types

References – Control Characters

Technical: Hadoop – Hive – Review Settings (Using the set command)

Technical: Hadoop – Hive – Review Settings (Using the set command)

Background

Fresh off another install Hadoop \ Hive, I googled on some items and found Apache’s own Hive documentation.
As per the install, I have found Cloudera’s install to be a bit clearer; as I can just install via RPMs.

For Apache’s install, I need to do my own build.

BTW, Apache has a fairly comprehensive doc @ https://cwiki.apache.org/Hive/gettingstarted.html

As Life is a contact sport and IT by nature is same, I do not mind reading through documentations and understanding some of the design thoughts.

One of the areas mentioned in Apache’s documentation is the role of the mapred.job.tracker.

Unlike our traditionally databases, Hive SQL Queries are actually translated to mapreduce code and then processed.

Thus mapreduce configurations are important.

Review Hive Settings

So how do we review Hive Settings.  By issuing set; of course.

Review Hive Settings – All Configuration Items


set;

 

Review Hive Settings – Hive Settings that are different from Hadoop settings


set -v;

 

Review Hive Settings – Specific Items

To review values for specific items, issue set followed by the item name.

Syntax:
   hive -e 'set <item>'

Sample:
    hive -e 'set hive.metastore.uris'

Output:

Hadoop -- Hive -- set

Review Hive Settings – Wild Card Items

To review values for specific items, we end up getting all items and then greping; use the cut command to restrict line length to 80 characters.

Syntax:
   hive -e 'set -v;' | cut 1-80 | grep <item>

Sample:
    hive -e 'set -v;' | cut -c 1-80 | grep mapred

Explanation:

  • In the example above, we are interested only in mapred items.

Output:

Hadoop -- Hive -- set -- wildcard (mapred)

References

Technical: Hadoop/Cloudera [CDH]/Hive v2 – Installation

Technical: Hadoop/Cloudera [CDH]/Hive v2 – Installation

Pre-requisites

There are quite a few pre-requisites that should be met prior to installing and running Hive.

They include:

Packages

The list of packages that comprise Hive Installation are:

Package Description
Hive base package that provides the complete language and runtime (required)
hive-metastore provides scripts for running the metastore as a standalone service (optional)
hive-server provides scripts for running the original HiveServer as a standalone service (optional)
hive-server2 provides scripts for running the new HiveServer2 as a standalone service (optional)

Installation – Package – Hive

 

Introduction

Let us review Hive’s base package.

Get Package Info

Use “yum info” to get a bit of information about the RPM.


Syntax:
      yum info <package-name>

Sample:
      yum info hive

Output:

Hadoop - Hive - package - hive -- info (v2)

Install Package – hive

Install Hive.


Syntax:
      sudo yum install <package-name>

Sample:
      sudo yum install hive

Output:

Hadoop - Hive - package - hive -- dependency

Dependencies

The Dependencies are listed

Package Description
hadoop-client Hadoop – Client
hadoop-mapreduce Hadoop – MapReduce (Version 1)
hadoop-yarn Hadoop – MapReduce (Version 2)

Install Log

The Installation Log.

Hadoop - Hive - package - hive -- Install - Log

Installation – Package – Hive-Metastore

Introduction

Install Hive Metastore.

Get Package Info

Use “yum info” to get a bit of information about the RPM.


Syntax:
      yum info <package-name>

Sample:
      yum info hive-metastore

Output:

Hadoop - Hive - package - hive-metastore -- info

Install Package – Hive-Metastore

Install Hive-MetaStore.


Syntax:
      sudo yum install <package-name>

Sample:
      sudo yum install hive-metastore

Output:

Hadoop - Hive - package - hive-metastore -- Install - Log

Installation – Package – Hive-Server2

Introduction

There are two versions of Hive Servers; Hive-Server and Hive-Server2.

Hive-Server2 is the latest and as there are no major reasons to shy from it, we will chose it over the earlier version (Hive-Server).

Get Package Info

Use “yum info” to get a bit of information about the RPM.


Syntax:
      yum info <package-name>

Sample:
      yum info hive-server2

Output:

Hadoop - Hive - package - hive-server2 -- info

Install Package – Hive-Server2

Install Hive Server 2.


Syntax:
      sudo yum install <package-name>

Sample:
      sudo yum install hive-server2

Output:

Hadoop - Hive - package - hive-server2 -- Install - Log

Configure – Hive MetaSource – Create Database & Schema

Introduction

The Database and Schema will be created in this section.

File Listing

There are a few SQL files that are bundled with Hive.  Here is a current file section:


Syntax:
      ls <folder> 

Sample:
      ls /usr/lib/hive/scripts/metastore/upgrade/mysql

Output:

Hadoop - Hive - SQL - Folder List

The files can broken down into sections:

  • ###-HIVE-####.mysql.sql ==> Each changes and bug fixes are packaged into a particular MYSQL module
  • hive-schema-<major>-<minor>-<delta>.mysql.sql ==> The accompanying SQL for each Hive version is packaged in whole
  • upgrade-<version-old>-to-<version-new> ==> Upgrade from one version to the next version

Which file to use

The file we will like to use is hive-schema-### files; as they cover a new install.

Create SQL Modules

 

Here are the steps.

Connect to database



Syntax:
         mysql -h <machine> -u <user> -p 

Sample:
         mysql -h localhost -u root -p 

Create database



Syntax:
         create database if not exists <database>;

Sample:
         create database if not exists metastore;

Output:

MySQL - Create Database

Change database context:



Syntax:
         use <database>;

Sample:
         use metastore;

Output:

MySQL - Change Database

Create Hive MetaStore database objects:



Syntax:
         source <filename>;

Sample:
         source /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;

Create Hive User

Here are the steps.

Create Hive User



Syntax:
    create user <user>@machine identified by <password>;

Sample:

--there are so many variations of the create user sql statement
--here are some of them
    --create user (hive) and grant access when on DB Server
    create user 'hive'@'localhost' identified by 'sanda1e';

    --create user (hive) and grant access from specific host (rachel)
    create user 'hive'@'rachel' identified by 'sanda1e';

    --create user (hive) and grant access from hosts in specific domain
    create user 'hive'@<domain-name> identified by 'sanda1e'

    --create user and allow access from all hosts
    create user 'hive'@'%' identified by 'sanda1e'; 

Command:

Create User & Output.

Hadoop - Hive - User and Password

Example:

  • In the example above we created a new user that is usable from any hosts that resides on the specified domain
  • To determine a host’s domain and Full Qualified Domain Name (FQDN), connect to the host and issue the following commands “hostname –domain” and “hostname –fqdn”
  • The password for the created user (hive) is sanda1e

Review User List

Here are the steps.

Query system table (mysql.user) and make sure that user was indeed created.



Syntax:
    select user, host, password from mysql.user;

Sample:

    select user, host, password from mysql.user;

Command:

List Users.

MySQL -- mysql-user (v2)

Explanation:

  • We see that the hive user is indeed created and binded to a specific domain

 

Grant permission to Hive User

Grant permission to Hive User.



Syntax:

   REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'metastorehost';

   GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE 
       ON metastore.* TO 'hive'@'metastorehost';

   FLUSH PRIVILEGES;

Sample:

   REVOKE ALL PRIVILEGES, GRANT OPTION 
        FROM 'hive'@'%.<domain-name>';

   GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE 
       ON metastore.* TO 'hive'@'%.<domain-name>';

   FLUSH PRIVILEGES;

Output:

Mysql - Hive - Grant permissions

Example:

  • Revoke all permissions from user
  • Grant DML permissions only

Review Permission Set

Review permission set.



Syntax:

  select User, Host, DB, Select_priv, Insert_priv, Delete_priv
             , Execute_Priv 
  from mysql.db;

Sample:

  Select User, Host, DB, Select_priv, Insert_priv, Delete_priv
             , Execute_priv 
  from   mysql.db;

 

Output:

Mysql - Hive -- Read permissions (v2)

Explanation:

  1. The record bearing the name “hive”.  This user is only valid when issued from a host in the domain specified in the Host column
  2. We have the privileges that we assigned

Validate User Access & permission set

Validate that user can connect to Mysql

Determine whether user can connect.  And, if so dig into connection ID.



Syntax:

  mysql -h <hostname> -u hive -p <password> -e <query>

Sample:

  mysql -h rachel -u hive -p "sanda1e" -e 'select current_user'

Output:

Mysql - Hive - Query -- Select curent_user

Explanation:

  • User hive is able to connect to host rachel and is being authenticated via its FQDN
  • The query ran is “select current_user”

Configure – Hive MetaSource – Configure MetaSource Service

Configuration File

File Name :- hive-site.xml

File Name (Full) :- /etc/hive/conf.dist/hive-site.xml

Configuration File – Deployment Host

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html

Though you can use the same hive-site.xml on all hosts (client, metastore, HiveServer), hive.metastore.uris is the only property that must be configured on all of them; the others are used only on the metastore host.

Configurable Elements

Configuration values – javax.jdo.option.ConnectionURL

There is a specific syntax for specifying the ConnectionURL.

Here is how it is defined for each Vendor\database:

Vendor Syntax
Apache Derby jdbc:derby:;databaseName=../build/test/junit_metastore_db;create=true
mysql jdbc:mysql://myhost/metastore
postgresql jdbc:postgresql://myhost/metastore
oracle jdbc:oracle:thin:@//myhost/xe

Configuration values – javax.jdo.option.ConnectionDriverName

Each Vendor provides Database applicable Java jar files; to gain connectivity to that database please specify the corresponding class name (that is defined specifically in the provided Jar file).

Vendor Value
Apache Derby org.apache.derby.jdbc.EmbeddedDriver
mysql com.mysql.jdbc.Driver
postgresql org.postgresql.Driver
oracle oracle.jdbc.OracleDriver

Configuration values – javax.jdo.option.ConnectionUsername

This is the username the Hive Service will use to connect to the database.  Conventionally, it will be hive.

For security conscious implementation consider changing.

Configuration values – javax.jdo.option.ConnectionPassword

This is the password for the defined username. 

Configuration values – datanucleus.autoCreateSchema

DataNucleus Community

http://www.datanucleus.org/products/accessplatform_2_2/rdbms/schema.html

DataNucleus can generate the schema for the developer as it encounters classes in the persistence process. Alternatively you can use the SchemaTool application before starting your application.

If you want to create the schema (tables+columns+constraints) during the persistence process, the property datanucleus.autoCreateSchema provides a way of telling DataNucleus to do this.

Configuration values – datanucleus.fixedDataStore

DataNucleus Community – Fixed Schema

http://www.datanucleus.org/products/accessplatform_2_2/jdo/datastore_control.html

Some datastores have a “schema” defining the structure of data stored within that datastore. In this case we don’t want to allow updates to the structure at all. You can set this when creating your PersistenceManagerFactory by setting the persistence property datanucleus.fixedDatastore to true .

Configuration values – datanucleus.hive.metastore.uris

This is the full URL to the metastore.

The syntax is thrift://<host>:<port>

Item Syntax Value
Protocol thrift thrift
Host host localhost
Port port 9083

Configuration values

Here are the pertinent values for us.

Item Syntax Value
javax.jdo.option.ConnectionURL jdbc:mysql://<host>/<database> jdbc:mysql://rachel/metastore
javax.jdo.option.ConnectionDriverName <jdbc-class-name> com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName <hive-username> hive
javax.jdo.option.ConnectionPassword <hive-password> sanda1e
datanucleus.autoCreateSchema false or true false
datanucleus.fixedDatastore false or true false
hive.metastore.uris thrift://<host>:<port> thrift://localhost:9083

Configure – Hive -to- MetaSource – Avail *-connector-java.jar

Introduction

For Hive to be able to connect to the MetaSource, let us install and avail the Connector Jar files.

As our metastore is back-ended by mysql, we will install the mysql-connector-java.

Install mysql-connector-java

$ sudo yum install mysql-connector-java

Create OS File System  Soft Link

$ sudo ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar

Is Soft Link in place?

Query Hive Lib Folder and ensure that we have a connector (mySQL Connector) file symbolically linked to our original install location:

$ ls -ila /usr/lib/hive/lib/mysql-connector-java.jar

Explanation:

  • Issue ls query against soft link recipient folder

Output:

Linux - SymLinks - Hive - Lib file

HDFS – File System – Permissions

Introduction

File System

By default Hive stores the actual tables and data in HDFS under the /user/hive/warehouse folder.

Therefore, please grant all users and groups that will be using Hive access to that folder and its sub-folders.

Please keep in mind that this construct is directed at HDFS and OS File System.

Hive, itself, has built in support for more higher level permission sets.  We will use Hive tooling a bit later.

It is also important to grant HDFS Folder level permissions to the temp (/tmp) Folder.

Type Folder Port#
HDFS /user/hive/warehouse 1777
HDFS /tmp 1777

File System – HDFS – Create Folder ( /user/hive/warehouse)

  • Create HDFS Folder –> /user/hive/warehouse
sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse

Explanation:

  • In the example above, we are assuming the identity of the HDFS owner and invoking hadoop fs module.  And, we specifically submitting a command (mkdir <folder>).  The folder we are creating is obviously /user/hive/warehouse.
  • Please keep in mind that is an HDFS Folder and not an OS Folder

Output:

Hadoop - HDFS -ls user--hive--warehouse

File System – Set HDFS Permissions ( /user/hive/warehouse)

  • Set HDFS Permissions –> /user/hive/warehouse
sudo -u hdfs hadoop fs -chmod -R 1777 /user/hive/warehouse

File System – Get HDFS Permissions ( /user/hive/warehouse)

  • Get HDFS Permissions –> /user/hive/warehouse
sudo -u hdfs hadoop fs -ls -d /user/hive/warehouse

Output:

Hadoop - HDFS -ls user--hive--warehouse (revised - v2)

File System – HDFS – Folder Permissions ( /tmp)

File System – HDFS – Set FS Permissions ( /tmp)

  • Set OS File Permissions –> /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

HDFS – Get FS Permissions ( /tmp)

  • Get HDFS Permissions –> /tmp
sudo -u hdfs hadoop fs -ls -d /tmp

Output:

Hadoop -- HDFS -- tmp

 

Services

Introduction

Here are Hive specific services; along with their Port#.

Service Name & Port#

Here are the pertinent values for us.

Service Service Name Port#
Hive MetaStore hive-metastore 9083
Hive Server (v2) hive-server2 10000

Start Command

Please start the services in the following order:

  • hive-server2
  • hive-metastore
sudo service hive-metastore start
sudo service hive-server2 start

Please refer and monitor the service specific log files for any errors and warnings.

Stop Command

Please stop the services in the following order:

  • hive-server2
  • hive-metastore
$ sudo service hive-server2 stop
$ sudo service hive-metastore stop

Log Files

Introduction

Here are the log files.

Here are the service specific log files.

Service Log File
Hive MetaStore /var/log/hive/hive-metastore.log
Hive Server (v2) /var/log/hive/hive-server2.log

Validate Install

Introduction

Here are some quick test that you can do to check connectivity between the components.

Hive Server -to- Meta Store Connectivity

As previously discussed, Hive stores schema data for each object in the metastore.

So one way of testing that the metastore is available and usable by Hive is to ask for all tables.

Launch Hive Client

Syntax:
        hive

Sample:
        hive

Ask for tables

Syntax:
        show tables;

Sample:
        show tables;

Output:

Hive -- Client -- Show Tables

Hive – Client Access Tools

Introduction

Hive is a database and so to test it out you will use any of the available database query tools that supports JDBC.

Why JDBC? Well, because it is a Java base tool.

Available tools

Here are some of the available tools:

Vendor\Repository Tool  URL
Sourceforge.net SQLLine http://sqlline.sourceforge.net/
Apache Hive http://archive.cloudera.com/redhat/cdh/unstable/RPMS/noarch/

Trouble Shooting

Trouble Shooting – Hive MetaStore- Server

Trouble Shooting – Hive MetaStore – Server – Configuration (/etc/hive/conf.dist/hive-site.xml) –>  hive.metastore.uris

Aforementioned, there is a lone configuration file and the name of that file is hive-site.xml.

If the hive-site.xml file is not fully configured, you will get the error message stated below:

Hive - MetaStore - Log - Error - MetaStore URL not configured

The error lines are:

  • org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9083.
  • Exception in thread “main” org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9083.
  • ERROR metastore.HiveMetaStore (HiveMetaStore.java:main(4150)) – Metastore Thrift Server threw an exception…

To remediate:

  • Please add the hive.metastore.uris entry.  And, specify the IP Address and FQDN for the thrift metastore
  • In earlier versions of Hive one can simply indicate locality (local or external metastore by simply employing the hive.metastore.local XML item). That is no longer the case and now the hive.metastore.uris is a required item
  • As we are in pseudo mode, we will simply use localhost and offer the default port# of 9083
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://localhost:9083</value>
  <description>
IP address (or fully-qualified domain name) and port of the 
metastore host
</description>
</property>

Couple of other important items.

  • This error is encountered when starting the Hive MetaStore Service ( sudo service hive-metastore start)
  • As the hive-metastore service is left running, please correct the XML file and restart the service ( sudo service hive-metastore restart)

Trouble Shooting – Hive MetaStore – Server – Configuration (/etc/hive/conf.dist/hive-site.xml) –> hive user’s password

If the Hive’s user password is incorrect, you will get the error pasted below:

Hive - MetaStore - Log - Error - JDOFatalDataStoreException -- Access denied for user

The error lines are:

  • javax.jdo.JDOFatalDataStoreException: Access denied for user ‘hive’@’rachel.labDomain.org’ (using password: YES)
  • java.sql.SQLException: Access denied for user ‘hive’@’rachel.labDomain.org’ (using password: YES)
  • ERROR metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(4247)) – javax.jdo.JDOFatalDataStoreException: Access denied for user ‘hive’@’labDomain.org’ (using password: YES)
  • ERROR metastore.HiveMetaStore (HiveMetaStore.java:main(4150)) – Metastore Thrift Server threw an exception…javax.jdo.JDOFatalDataStoreException: Access denied for user ‘hive’@’rachel.labDomain.org’ (using password: YES)

To remediate:

  • Please check the hive user and password


<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
  <description>
  </description>
</property>

<property>
  <name>java.jdo.option.ConnectionPassword</name>
  <value>password</value>
  <description>
  </description>
</property>

Trouble Shooting – Hive Server2 – Server – Configuration (/etc/hive/conf.dist/hive-site.xml) –> unable to connect to hive metastore

Hive -- Server2 - Log (unable to connect to metastore)

  • hive.metastore (HiveMetaStoreClient.java:open(285)) – Failed to connect to the MetaStore Server…
  • service.CompositeService (CompositeService.java:start(74)) – Error starting services HiveServer2org.apache.hive.service.ServiceException: Unable to connect to MetaStore!
  • Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused

To remediate:

  • Check if Hive MetaStore is running
  • Determine what port is running under
  • Consider restarting service
  • Consider killing MetaStore process ID

Trouble Shooting – Hive MetaStore – Server – Missing DB Jar file

Depending on the DB Backend you have chosen – Remember it can be Derby, Mysql, PostgreSQL, or Oracle – you might experience connectivity issues.

The error lines are:

  • javax.jdo.JDOFatalDataStoreException: Access denied for user ‘hive’@’rachel.labDomain.org’ (using password: YES)
  • java.sql.SQLException: Access denied for user ‘hive’@’rachel.labDomain.org’ (using password: YES)
  • ERROR metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(4247)) – javax.jdo.JDOFatalDataStoreException: Access denied for user ‘hive’@’labDomain.org’ (using password: YES)
  • ERROR metastore.HiveMetaStore (HiveMetaStore.java:main(4150)) – Metastore Thrift Server threw an exception…javax.jdo.JDOFatalDataStoreException: Access denied for user ‘hive’@’rachel.labDomain.org’ (using password: YES)

To remediate:

  • Please check the hive user and password


<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
  <description>
  </description>
</property>

<property>
  <name>java.jdo.option.ConnectionPassword</name>
  <value>password</value>
  <description>
  </description>
</property>

Trouble Shooting – Hive Server2 – Server – Configuration (/etc/hive/conf.dist/hive-site.xml) –> unable to connect to hive metastore

Hive -- Server2 - Log (unable to connect to metastore)

  • hive.metastore (HiveMetaStoreClient.java:open(285)) – Failed to connect to the MetaStore Server…
  • service.CompositeService (CompositeService.java:start(74)) – Error starting services HiveServer2org.apache.hive.service.ServiceException: Unable to connect to MetaStore!
  • Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused

To remediate:

  • Check if Hive MetaStore is running
  • Determine what port is running under
  • Consider restarting service
  • Consider killing MetaStore process ID

Trouble Shooting – Hive MetaStore – Server – Missing JDBC Connector Jar files

Hive -- Server2 - Log (unable to connect to metastore)

  • javax.jdo.JDOFatalInternalException: Error creating transactional connection factory at org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:425)
  • Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the “DBCP” plugin to create a ConnectionPool gave an error : The specified datastore driver (“com.mysql.jdbc.Driver”) was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
  • Caused by: org.datanucleus.store.rdbms.datasource.DatastoreDriverNotFoundException: The specified datastore driver (“com.mysql.jdbc.Driver”) was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
  • ERROR metastore.HiveMetaStore (HiveMetaStore.java:main(4150)) – Metastore Thrift Server threw an exception…javax.jdo.JDOFatalInternalException: Error creating transactional connection factoryat org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:425)

To remediate:

  • Missing database connector file
  • In our case, we needed
    
    sudo ln -s /usr/share/java/mysql-connector-java.jar \
        /usr/lib/hive/lib/mysql-connector-java.jar
    

Trouble Shooting – Hive MetaStore – Server – Access Control Exception – Permission denied

Permission denied.

Screen Output:

Hive - Client - Permission denied

Text Output:



hive> create table xyzlogTable (dateC string);
FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=hive, 
access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x
 at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(
FSPermissionChecker.java:205)
 at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(
FSPermissionChecker.java:186)
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:135)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4684)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:4655)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:2996)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:2960)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2938)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:648)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:417)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44096)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> 

To remediate:

  • Review the File System & HDFS permissions

References

References – MetaStore Admin

References – HDFS – Shell

References – Files – Symbolic Links

References – Files – Unix Bash

Technical: Hadoop/Cloudera [CDH]/MetaStore – MySQL Database

Technical: Hadoop/Cloudera [CDH]/MetaStore – MySQL Database

Background

Laying the groundwork for Hadoop/Cloudera [CDH]/Hive installation and trying to do my homework.  One of the pre-requisites is a Hive MetaStore.

The Hive MetaStore stores metadata information for Hive Tables.  It provides the base plumbing for the Hive Tables.

The Hive database is a bit better positioned by externalizing some of its ‘metadata’; as a so called “Embedded Database“.

Thus, the choice in terms of reliability is very important.

Choices

The choice of MetaStore currently comes down to a few Database products.  For Cloudera 4.2 the choices are tabulated in an easy to read format @

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_2.html#topic_2_unique_4

Database Version
MySQL v5.5
PostgreSQL v8.3
Oracle v11gR2
Derby Default

Our Choice

As we are gunning for a small, self contained, wildly understood LAB environment we settled on MySQL and PostgreSQL.

It appears that post MySQL purchase by a big Database Vendor, some people have started shopping around for other databases still principally owned by the Opened Source community.

And, so PostgreSQL is growing a bit.  A couple of years ago, I set up my own WordPress Instance and found out it bundles a PostgreSQL database.

And, so I really will like to go with PostgreSQL.  But, for no known technical reason, I will stick with MySQL because it really appears that Hive is foundationally closer in terminology and usage to MySQL.

Install MySQL Server

Install MySQL Server



Syntax:
    sudo yum install mysql-server

Sample:
    sudo yum install mysql-server

Output:

mySQL - Installation - Install Log

MySQL – Start Service/daemon

Start mySQL daemon



Syntax:
      sudo service mysqld start

Sample:
      sudo service mysqld start 

Output:

mySQL - Start - Log

Explanation:

  • Mostly everything is good
  • The lone major problem is a DNS issue that has being bugging me for a few days now.
  • It all started with my Microsoft Windows Active Directory going down — Everything after that has being down hill
  • MySQL uses hostnames and domain names for user privileges and so I have to be on the look-out for hard-coded host and FQDN names

MySQL Connector

 

Introduction

To connect to MySQL from Hive one needs client libraries.

As Hive is a Java based application we need libraries that are consumable via Java.

Here is a quick overview of MySQL Connector:

Overview of MySQL Connector:

http://dev.mysql.com/doc/refman/5.6/en/connector-j-overview.html

MySQL provides connectivity for client applications developed in the Java programming language through a JDBC driver, which is called MySQL Connector/J.

MySQL Connector/J is a JDBC Type 4 driver. Different versions are available that are compatible with the JDBC 3.0 and JDBC 4.0 specifications. The Type 4 designation means that the driver is a pure Java implementation of the MySQL protocol and does not rely on the MySQL client libraries.

For large-scale programs that use common design patterns of data access, consider using one of the popular persistence frameworks such as Hibernate, Spring’s JDBC templates or Ibatis SQL Maps to reduce the amount of JDBC code for you to debug, tune, secure, and maintain.

RPM Package Name

The name of the the RPM is mysql-connector-java

 

Get Package Info

Use “yum info” to get a bit of information about the RPM.


Syntax:
      yum info <package-name>

Sample:
      yum info mysql-connector-java

Output:

Mysql - mysql-connector-java -- yum -- info

List Package Files

As we do not yet have the rpm and we do not really want to download it, let us use “repoquery” to get a listing of files that are bundled in the RPM.


Syntax:

      repoquery -lq <package-name>

Sample:
      repoquery -lq mysql-connector-java

Output:

Mysql - mysql-connector-java -- repoquery

Explanation:

  • The files will mostly be installed in /usr/share/java
  • Jar file -> /usr/share/java/mysql-connector-java-5.1.17.jar
  • Jar file -> /usr/share/java/mysql-connector-java.jar
  • Build Tool (maven)-> /etc/maven/fragments/mysql-connector-java and /usr/share/maven2/poms/JPP-mysql-connector-java.pom
  • And, a few doc files

The files that we actually need are the JAR files.

Install MySQL-Connector-Java

We have enough “datasets” to know which files we will need to copy to Hives nodes.

As they are Jar files availing them on the Hive Nodes and adjusting the ClassPath should be enough to avail them  to the Hive Clients.

Please keep in mind that the previous statement is only true for Type-4 JDBC Files which this one, MySQL, clearly is.

Let us go ahead with the actual install.



Syntax:
    sudo yum install mysql-server

Sample:
    sudo yum install mysql-connector-java

Output:

Mysql - mysql-connector-java - install - log

 

Avail MySQL Connector for Hive Usage

Avail MySQL Connector for Hive Usage

Introduction

Installing MySQL avails client modules that are needed by clients such as Hive to connect to Mysql.

These files needs to be placed in the Hive Library folder

Review Client Files

The MySQL-Connector-java Installer places the files in the /usr/share/java folder :-



Syntax:
    ls -la <folder> | grep "mysql-connector-java"

Sample:

    ls -la /usr/share/java | grep "mysql-connector-java"

 

Output:

Mysql - mysql-connector-java - client

Output Interpretation

File Use
mysql-connector-java-5.1.17.jar Actual File (Version specific)
mysql-connector-java.jar Symbolic file (that currently links to connector-java-5.1.17.jar)

Destination

On each Hadoop Hive node, the identified files will need to copied to /usr/lib/hive/lib folder

MySQL – Secure Database

Introduction

During Installation, MySQL’s root password is left un-set.

Please!, please! set it to something secured once you have finished installing MySQL Server and Client.

Hardening MySQL – Scope

The process of better protecting an Application is sometimes referred to as “Hardening” that Application.

MySQL is bundled with a fairly simple script that offer’s basic hardening. The name of the script is mysql_secure_installation.

The areas covered are listed under:

mysql_secure_installation — Improve MySQL Installation Security

http://dev.mysql.com/doc/refman/5.0/en/mysql-secure-installation.html

a) You can set a password for root accounts.

b) You can remove root accounts that are accessible from outside the local host.

c) You can remove anonymous-user accounts.

d) You can remove the test database (which by default can be accessed by all users, even anonymous users), and privileges that permit anyone to access databases with names that start with test_.

Hardening MySQL – Actual



Syntax:
    sudo /usr/bin/mysql_secure_installation

Sample:
    sudo /usr/bin/mysql_secure_installation

Output:

Mysql - secure installation

MySQL – Client Install – What is the RPM Name?

Introduction

If we will be querying MySQL from a different host, I know I need to install the RPM.

Find MySQL Client RPMs

Let us use yum list to query for packages whose name start with mysql…


Syntax:
    yum list mysql\*

Sample:
    yum list mysql\*

Output:

Mysql - yum list mysql

The RPMs that comes up are tabulated below – For each RPM, we ran “yum info <package-name>” to get package’s summary information.

 

Package Use Version#
mysql.i686 MySQL client programs and shared libraries 5.1.69-1.el6_4
mysql-connector-java.noarch Official JDBC driver for MySQL 1:5.1.17-6.el6
mysql-libs.i686 The shared libraries required for MySQL clients 5.1.69-1.el6_4
mysql-server.i686 The MySQL server and related files 5.1.69-1.el6_4
MySQL-python.i686 Python interface to MySQL 1.2.3-0.3.c1.1.el6
mysql-connector-odbc.i686 ODBC driver for MySQL 5.1.69-1.el6_4
mysql-devel.i686 Files for development of MySQL applications 5.1.69-1.el6_4
mysql-embedded.i686 MySQL as an embeddable library 5.1.69-1.el6_4
mysql-embedded-devel.i686 Development files for MySQL as an embeddable library 5.1.69-1.el6_4
mysql-test.i686 The test suite distributed with MySQL 5.1.69-1.el6_4

Choice

The most likely RPMs appear to be the plain mysql.i686

MySQL – Secure Database (Review)

Introduction

Let us quickly review our MySQL Instance and make sure that the changes look OK, and that MySQL is actually installed and available via the root password that we used.

Connect to mysql

Connect to mysql



Syntax:
    /usr/bin/mysql --host <hostname> --user root --password

Sample:
    /usr/bin/mysql --host localhost --user root --password

Output:

Mysql - Client - Initiate

Issue Test

Issue Tests

Issue Test – Does root account have a password

Check the mysql.user table and review the user, host, and password entries:



Syntax:
         select user, host, password from mysql.user 

Sample:
         select user, host, password from mysql.user

Output:

Mysql -- Query -- mysql.user

Explanation:

  • As the password actually has contents, it is invariably not blank/empty
  • There are two entries — one for localhost and the other one 127.0.0.1; 127.0.0.1 is very useful for cases where your DNS Name resolution is not working
  •  More information about 127.0.0.1 is available @

    http://dev.mysql.com/doc/refman/5.0/en/default-privileges.html

    On Unix, each root account permits connections from the local host. Connections can be made by specifying the host name localhost, the IP address 127.0.0.1, or the actual host name or IP address.

Issue Test – Disallow root account from outside the Database Server

Check the mysql.user table and review the user, host, and password entries:



Syntax:
         select user, host, password from mysql.user 

Sample:
         select user, host, password from mysql.user

Output:

Mysql -- Query -- mysql.user

Explanation:

  • Ensure that root records that have hosts bearing localhost and 127.0.0.1 are the only ones present

Issue Test – Anonymous user

Check the mysql.user table and review the user, host, and password entries:



Syntax:
         select user, host, password from mysql.user 

Sample:
         select user, host, password from mysql.user

Output:

Mysql -- Query -- mysql.user

Explanation:

  • Ensure that the anonymous user is not present
  • An empty user column in the mysql.user table signifies that anonymous users have access
  • Here is a quick sample for granting permissions to the anonymous user

    How to mySQL create Anonymous Account
    http://www.cyberciti.biz/tips/howto-mysql-create-anonymous-account.html

    mysql> use <dbname>;

    mysql> GRANT SELECT ON xyz TO ”@localhost

Issue Test – Is the Test database present

Issue “show databases”



Syntax:
         show databases;

Sample:
         show databases;

Output:

Mysql -- Metadata -- show databases

Explanation:

  • The test database is still present
  • Will ignore for now

MySQL – Configuration – Availability

Introduction

We are making MySQL a big part of our Hadoop\Hive install.

Does it AutoStart?

We want it on, all the time!


Syntax:
         sudo /sbin/chkconfig --list <service-name>

Sample:
         sudo /sbin/chkconfig --list mysqld

Output:

chkconfig -- list -- mysqld

Explanation:

  • mysqld is not configured to auto-start

let us go correct that!

Set for AutoStart

Set for AutoStart..


Syntax:

         --to auto-start a service to auto-start at runlevels 2,3,4,5
         sudo /sbin/chkconfig <service-name> on

         --to auto-start a service to auto-start at specific run-levels
         sudo /sbin/chkconfig <service-name> on --levels <runlevels>

Sample:
         sudo /sbin/chkconfig mysqld on

         sudo /sbin/chkconfig mysql on --levels 345

Output:

chkconfig -- set -- levels (v2)

Explanation:

  • mysqld is being configured to auto-start at the following levels (2,3,4,5)

MySQL – Configuration – Resource Utilization

Introduction

Depending on what MySQL is being used for and how heavily it is being leaned on, it is a good idea to set maximum resource usage limits.

Memory

It does not appear that MySQL in its current iteration has a single setting for overall memory utilization.

Other database systems, i.e. Microsoft SQL Server has settings such as ‘max server memory’ that is useful in getting close to setting the max amount of memory a SQL Server Instance will ever request.

Unfortunately, setting max memory involves more component level setting in MySQL.

Please Google to broaden your familiarity with this issue.  Here are some of what I came up with:

MySQL – Configuration – ErrorLog ( Looking for Answers)

Introduction

It goes without saying that unlike a client based desktop application where errors blow up in your face via message boxes or through specially highlighted / color coded error messages, errors in server (daemon) applications are quietly logged in error logs.

It is up to you the sudden MySQL proprietor to determine where the logs are placed and monitor the log metadata (sizes) and contents.

Where are the Log Files?

Where are the log files? I do not know either.

So let us google for “Mysql Error Log Location”.

Google nailed it as I quickly found “Ronald Bradford”. Ronald calls himself the MySQL Expert.  The specific blog posting I found is:

Monitoring MySQL The Error Log
http://ronaldbradford.com/blog/monitoring-mysql-the-error-log-2009-09-16/

Based on his presentation alone, you can tell this Kid is serious about his MySQL.

As the good word says:

And how will anyone go and tell them without being sent? That is why the Scriptures say, “How beautiful are the feet of messengers who bring good news!”

From Ronald’s blog, I know a lot more about MySQL than I would have ever known:

a) If not specified the default is [datadir]/[hostname].err
b) He advises that the log location should be changed via editing the my.cnf file

MySQL – Configuration – ErrorLog ( Researching Answers )

Introduction

Again, Ronald says to look under the DataDir

Research

Where is the DataDir?  First, where is MySQL running from?

As I know I have a running mySQL process, I will try using ps.  On the other hand, if mySQL is not running would probably have issued “find / -name datadir 2> /dev/null”.


Syntax:
         ps -aux | grep -i mysql

Sample:
         ps -aux | grep -i mysql

Output:

Mysql -- ps

Explanation:

  •  The datadir is /var/lib/mysql

Check /var/lib/mysql for *.err file



Syntax:
         ls -la /var/lib/mysql

Sample:
         ls -la /var/lib/mysql

Output:

Mysql -- Folder Listing -- var-lib-mysql

Explanation:

    • The error file is not there
    • But, like all good directions, I know Ronald Bradford’s instructions are good and complete
    • If we go back and look at result of “ps  -aux | grep “mysql”
      
      Command: 
      ps -aux | grep -i mysql | grep -v "grep" 2>/dev/null
      
      Output: 
      
      root 1676 0.0 0.0 5120 1408 ? S 08:23 0:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/mysql.sock --pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql
      
      mysql 1778 0.0 0.5 135476 15232 ? Sl 08:23 0:00 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
      
      [dadeniji@rachel ~]$
      
      
    • Thus, it appears from the command line that “log error” is pointing at /var/log/mysqld.log
    • When I issued “cat /var/log/mysqld.log”, received error message stating that “permission is denied”; which makes sense as mysql is running under the mysql user; this also follows what Ronald wrote.  He suggested that we change the permission set for this file; and allow it be view-able by other users who are responsible for monitoring application and Administrative users:
    • Mysql - Log File - var-log-mysql (permission denied) Validated instruction set of /var/log/mysqld.log
    • Mysql - Log File - var-log-mysql

Application Specific Tooling

So once the base MySQL Engine is installed, the next thing to do is “tailor” it for various applications.

That tailoring will include creating Application specific principals (users), objects, and granting users access to the declared objects.

For instance, Hive has its own user and objects that we will have to create as part of its installation.

Since we do not have Hive installed yet, we will wait upon installing it and thus have easy and ready access to the SQL Scripts that contains the object creation statements.

Application Specific Tooling – Hive

As a place holder, here are the steps for HIVE.

For Hive, the steps above are concertize as:

  • Create database known as metastore
  • The schema script’s name is formatted as  /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-x.y.z.mysql.sql; ie /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql
  • Create Hive specific database user
  • Grant permissions to the Hive user
  • Configure Application specific configuration file.  For Hive the file’s name is hive-site.xml 

References

References – RPM

References – MySQL – mysql_secure_installation

References – Services – Configuration (chkconfig)

References – MySQL – Error Log

References – MySQL – Permissions

Technical: Hadoop/Cloudera (v4.2.1) – Installation on CentOS (32 bit [x32]) – HBase

Technical: Hadoop/Cloudera (v4.2.1) – Installation on CentOS (32 bit [x32]) – HBase

Introduction

For Cloudera Distribution of Hadoop, the install is bundled as an RPM. And, thus it is a straight install.

Once installed, there are some post install configuration steps. Let us see how things work out.

Installation – HBASE (Base)

Introduction

Let us review the install binary.  The package’s is eponymously named HBASE.

Review Package Binary

Let us issue “yum info” to review the package.

Syntax:
   
   yum info <package-name>

Actual:
   yum info hbase

yum --info hbase

Explanation:

  • Couple of things stands out – The architecture is i686.
  • The version is 0.94
  • And, it is part of the CDH4

So just about everything is good and matches our current base install.  But, I am thinking I am not quite sure about i686.

Dependency Info

Dependency info



Syntax:

   yum deplist --nogpgcheck  <package-name>

Sample:

   yum deplist --nogpgcheck  hbase

Output:

yum -- hbase -- dependency check

Installing HBASE (Base)

Introduction

Here are the actual RPM install steps.



sudo yum install hbase

Output:

yum -- hbase -- install rpm

I feel like Doug Flutie in ’84; threw up a Hail Mary, thinking it will not install on a 32-bit system; but it did.

Review RPM Installed Files

Introduction

Use “rpm -ql <package>”, to review installed files.

Syntax:

   rpm -ql <package-name>

Sample:
   rpm -ql hbase

Review – shell (.sh)


Syntax:

   rpm -ql <package-name> | grep -i "sh"

Sample:
  rpm -ql hbase | grep -i "sh"

Here are the Shell (Unix Bash shell and ruby) files.

Output:

yum -- hbase -- ql -- shell

Review – Configuration files (.xml)

Configuration data are usually tucked away in XML files.


Syntax:

   rpm -ql <package-name> | grep -i "xml"

Sample:
  rpm -ql hbase | grep -i "xml"

Here are the XML files.

Output:

yum -- hbase -- ql -- xml

 

Review – Java Jar files


Syntax:

   rpm -ql <package-name> | grep -i "jar"

Sample:
  rpm -ql hbase | grep -i "jar"

Here are the Java Jar files.

Output:

 Jar File Purpose
 /usr/lib/hbase/hbase.jar Hbase
 /usr/lib/hbase/lib/avro-1.7.3.jar Data Serialization System
 /usr/lib/hbase/lib/httpclient-4.1.3.jar Http Client
 /usr/lib/hbase/lib/jetty-6.1.26.cloudera.2.jar Jetty is a pure Java-based HTTP server and Java Servlet container
 /usr/lib/hbase/lib/libthrift-0.9.0.jar The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly
 /usr/lib/hbase/lib/protobuf-java-2.4.0a.jar Secure Data Connector protocol reference implementation
 /usr/lib/hbase/lib/slf4j-api-1.6.1.jar The Simple Logging Facade for Java or (SLF4J) serves as a simple facade or abstraction for various logging frameworks
/usr/lib/hbase/lib/snappy-java-1.0.4.1.jar Compression – Snappy
/usr/lib/hbase/lib/zookeeper.jar ZooKeeper

Installation – HBASE (Master)

Introduction

This file installs the HBASE Master service.

Let us review the install binary.  The package’s name is hbase-master.

Review Package Binary

Let us issue “yum info” to review the package.

Syntax:

   yum info <package-name>

Actual:

   yum info hbase-master

Output:

Hadoop - HBase - Master - yuminfo

Explanation:

  • The architecture is also i686.
  • The version is 0.94
  • And, it is part of the CDH4

Installing HBASE (Master)

Introduction

Here are the actual RPM install steps.


Syntax:
   sudo yum install hbase-master

Sample:
   sudo yum install hbase-master

Output:

 Hadoop - HBase - Master - rpm install

Review RPM Installed Files (HBase-Master)

Introduction

Use “rpm -ql <package>”, to review installed files.

Syntax:

   rpm -ql <package-name>

Sample:
   rpm -ql hbase-master

Output:

Hadoop - Hbase - Master - File List

Explanation:

  • The lone file installed is /etc/rc.d/init.d/hbase-master
  • As this file is in the /etc/rc.d/init.d folder it is a service initialization script
  • The file is a Bash shell script and it is easy enough to read and follow

Installing – Zookeeper

Introduction

This file installs the ZooKeeper service.

Let us review the install binary.  The package’s name is zookeeper-server.

Review Package Binary

Let us issue “yum info” to review the package.

Syntax:

   yum info <package-name>

Actual:

   yum info zookeeper-server

Output:

Hadoop - Zookeeper - Master -- yuminfo

Explanation:

  • The architecture is noarch.
  • The version is 3.4.5+16
  • And, it is part of the CDH4

Install Zookeeper

Introduction

Here are the actual RPM install steps.


sudo yum install zookeeper-server

Output:

Hadoop - Zookeeper - Install - Log

Review Zookeeper

Introduction

Use “rpm -ql <package>”, to review installed files.

Syntax:

   rpm -ql <package-name>

Sample:
   rpm -ql zookeeper-server

Output:

Hadoop - Zookeeper - rpm - review

Explanation:

  • The lone file installed is /etc/rc.d/init.d/zookeeper-server
  • As this file is in the /etc/rc.d/init.d folder it is a service initialization script

Installing – HBase – Region Server

Introduction

This file installs HBase Region Server.

Let us review the install binary.  The package’s name is hbase-regionserver.

Review Package Binary

Let us issue “yum info” to review the package.

Syntax:

   yum info <package-name>

Actual:

   yum info hbase-regionserver

Output:

Hbase - RegionServer -- yumInfo

Explanation:

  • The architecture is i686.
  • The version is 0.94
  • And, it is part of CDH4

Installing – HBase – Region Server

Introduction

Here are the actual RPM install steps.


sudo yum install hbase-regionserver

Output:

Hadoop - Hbase - RegionServer - Install - Log (v2)

Review Hbase Region-Server

Introduction

Use “rpm -ql <package>”, to review installed files.

Syntax:

   rpm -ql <package-name>

Sample:
   rpm -ql hbase-regionserver

Output:

Hadoop - Hbase - RegionServer - rpm - review

Explanation:

  • The lone file installed is /etc/rc.d/init.d/hbase-regionserver
  • As this file is in the /etc/rc.d/init.d folder it is a service initialization script

CDH Services (HBase)

Prepare Inventory of CDH Services

Service List

Here is our expected Service List.

Component Service Name
HBase Master hbase-master
HBase ZooKeeper zookeeper-server
HBase Region Server hbase-regionserver

Using Chkconfig list Hadoop HBase & ZooKeeper Services?


Syntax:

   # list all services
   sudo chkconfig --list 

   # list specific services, based on name
   sudo chkconfig --list | grep -i <service-name>

Sample:

   sudo chkconfig --list | egrep -i "hbase|zoo"

Output:

Hadoop - HBase - Services - List

CDH Services And Network Ports (Prior HBase)

Here are the list of ports being used by Hadoop Core Services; prior to starting HBase.

Using netstat:



Syntax:

   # list all services
   sudo netstat  --ltp

Sample:

   sudo netstat -ltp

Output:

netstat -ltp
Using Hadoop Default Ports ( http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/ ), we are able to map well known network ports to Application:

Port Application
50010 Hadoop – HDFS – Data Node
50020 Hadoop – HDFS – Data Node
50030 Hadoop – Map Reduce – Job Tracker
50060 Hadoop – Map Reduce – Task Tracker
50070 Hadoop – HDFS – Name Node
50075 Hadoop – HDFS – Data Node
50090 Hadoop – HDFS – Secondary Name Node
58160 Hadoop – MapReduce

 

Post Installation – Configuration

Introduction

Let us review the configuration files and determine whether there are some things we need to do..

Review HBASE Configuration files

Here are the HBASE Configuration files:

  • /etc/hbase/conf.dist/hbase-policy.xml
  • /etc/hbase/conf.dist/hbase-site.xml

hbase-policy.xml

HBASE Configuration files – hbase-policy.xml

As the name indicates, the file contains Policy data.  By policy, we mean security policy.

Policy Use
security.client.protocol.acl ACL for clients talking to HBase Region Server
security.admin.protocol.acl ACL for HMaster Interface protocol implementation – clients talking to Hmaster for admin operations
security.masterregion.protocol.acl Region Servers communicating with HMaster Server

hbase-site.xml

Introduction

So the full file name for hbase-site.xml is /etc/hbase/conf.dist/hbase-site.xml

Let us review the current contents

Delivered Configuration

Policy Use
hbase.cluster.distributed true
hbase.rootdir hdfs://myhost:port/hbase

Entry – hbase.cluster.distributed

  • Set hbase.cluster.distributed to true

Entry – hbase.rootdir

  • Be sure to replace myhost with the hostname of your HDFS NameNode (as specified by fs.default.name or fs.defaultFS in your conf/core-site.xml file); you may also need to change the port number from the default (8020).
  • In CDH4, there are two core-site.xml files — /etc/hadoop/conf.empty/core-site.xml and /etc/hadoop/conf.pseudo.mr1/core-site.xml
  • The /etc/hadoop/conf.empty/core-site.xml is empty
  • On the hand, the /etc/hadoop/conf.pseudo.mr1/core-site.xml has data in it.
    
    
        <name> fs.default.name </name>
        <value> hdfs://localhost:8020 </value>
    

let us go use hdfs://localhost:8020/hbase as hbase.rootdir

Configuration – Post Changes

Here are the post changes.

Item Value
hbase.cluster.distributed true
hbase.rootdir hdfs://localhost:8020/hbase

Configuration – Post Changes

Hadoop - Hbase - hbase-site (20130521 0246PM)

Post Installation – Create HDFS File System

Introduction

Everything with data in Hadoop is back-ended by HDFS.

Let us go create and set File System permissions on our HDFS Name Node.

 

HDFS – HBase Folder – Create Folder

Create HDFS folder /hbase


Syntax:

    sudo -u hdfs hadoop fs -mkdir /hbase

Sample:

   sudo -u hdfs hadoop fs -mkdir /hbase

HDFS – HBase Folder – Permission – Review (Post Folder Creation)

Review /HBASE permissions:


Syntax:

    sudo -u hdfs hadoop fs -ls -d <folder>

Sample:

    sudo -u hdfs hadoop fs -ls -d /hbase

Output:

Hadoop - Hbase - Folder Permissions (initial)Explanation:

  • Issuing “hadoop fs -ls -d” against /hbase returns what we thought it will.  The folder’s owner is hdfs; and the owner is supergroup.
  • The folder’s owner has full permissions (rwx)
  • The owner’s group has read and execute permissions (rx)
  • Others also have read and execute permission (rx)

HDFS – HBase Folder – Permission – Change Ownership

As we created our new folder using the hbase user, we need to go in and change its owner to hbase.

This ownership change allows our HBASE binaries full control on this folder (/hbase); which is no problem as the HBASE folder is fully dedicated to HBASE.


Syntax:

    sudo -u hdfs hadoop fs -chown <owner> <folder>

Sample:

    sudo -u hdfs hadoop fs -chown hbase /hbase

HDFS – HBase Folder – Permission – Review

Review /HBASE permissions:


Syntax:

    sudo -u hdfs hadoop fs -ls <folder>

Sample:

   sudo -u hdfs hadoop fs -ls /hbase

Output:

Hadoop - Hbase - Folder Permissions (change folder owner)

Post Installation – Zookeeper – Init Data Directory

Init ZooKeeper Data Directory


sudo service zookeeper-server init

If Zookeeper Data directory is already initialized, and you try to re-init it, you will get an error message.



Zookeeper data directory already exists at /var/lib/zookeeper ( or use 
--force re-initialization)

Check Services and Applications

Check

Let us quick check which Hadoop processes we have running.


Syntax:
    sudo jps

Sample:

   sudo /usr/java/jdk1.7.0_21/bin/jps

Output:

jps - before hadoop - hbase

 

Initiate Hadoop/HBase Services – ZooKeeper-Server

Start ZooKeeper

Let us start ZooKeeper.


Syntax:
   sudo service <service-name> start

Sample:

   sudo service zookeeper-server start

Output (Text):



JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
ZooKeeper data directory is missing at /var/lib/zookeeper fix the path or run initialize

Output (Screen Dump):

Hadoop - ZooKeeper Data DIrectory is missing

Review ZooKeeper Configuration File

Nice.

Zookeeper configuration file is /etc/zookeeper/conf/zoo.cfg

Here is our current configuration:

Item Value
dataDir /var/lib/zookeeper
clientPort 2181

Review ZooKeeper Data Directory


ls -la /var/lib/zookeeper

Output:

Hadoop - Zookeeper - dataDir (initial)

Init ZooKeeper Data Directory


sudo service zookeeper-server init

Output:


No myid provided, be sure to specify it in /var/lib/zookeeper/myid
if using non-standalone

If Zookeeper Data directory is already initialized, and you try to re-init it, you will get an error message.



Zookeeper data directory already exists at /var/lib/zookeeper ( or use 
--force re-initialization)

Review ZooKeeper Data Directory


ls -la /var/lib/zookeeper

Output:

Hadoop - Zookeeper - dataDir (initial)

(Re) Start ZooKeeper

Re-start ZooKeeper.


Syntax:
   sudo service <service-name> start

Sample:

   sudo service zookeeper-server start

Output:



[dadeniji@rachel ~]$ sudo service zookeeper-server start
[sudo] password for dadeniji:
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED
[dadeniji@rachel ~]$



ZooKeeper – Log – Review

Review ZooKeeper Log files:

  • /var/log/zookeeper/zookeeper.log
  • /var/log/zookeeper/zookeeper.out

ZooKeeper – Log – Review – zookeeper.log

  • Either no config or no quorum defined in config, running  in standalone mode
  • Reading configuration from: /etc/zookeeper/conf/zoo.cfg
  • Server environment:zookeeper.version=3.4.5-cdh4.2.1–1, built on 04/22/2013 16:45 GMT
  • Server environment:java.version=1.7.0_21
  • Server environment:java.vendor=Oracle Corporation
  • Server environment:java.home=/usr/java/jdk1.7.0_21/jre
  • Server environment:java.library.path=/usr/java/packages/lib/i386:/lib:/usr/lib
  • Server environment:user.name=zookeeper
  • Server environment:user.home=/var/run/zookeeper
  • binding to port 0.0.0.0/0.0.0.0:2181

ZooKeeper – Log – Review – zookeeper.out

  • file is empty

ZooKeeper – jps

Using jps, we are able to validate the the zookeeper app is running.  The app name is QuorumPeerMain

Initiate Hadoop/HBase Services – HBase Master

Start HBase Master

Let us start hbase-master.


Syntax:
   sudo service <service-name> start

Sample:

   sudo service hbase-master start

Output (Text):



[dadeniji@rachel noip-2.1.9-1]$ sudo service hbase-master start
[sudo] password for dadeniji:
starting master, logging to /var/log/hbase/hbase-hbase-master-rachel.out
[dadeniji@rachel noip-2.1.9-1]$

Output (Screen Dump):

hadoop -- service -- hbase-master -- start

HBase – Master – Log – Review

Review ZooKeeper Log files:

  • /var/log/hbase/hbase-hbase-master-<hostname>.log
  • /var/log/hbase/hbase-hbase-master-<hostname>.out
  • /var/log/hbase/securityAuth.audit

HBase – Log – Review – hbase-hbase-master-<hostname>.log

  • DEBUG org.apache.hadoop.hbase.util.FSUtils: hdfs://localhost:8020/hbase/.archive doesn’t exist

HBase – Log – Review – hbase-hbase-master-<hostname>.out

  • file is empty

Hbase – Master – jps

Using jps, we are able to validate the the HBase Master app is running.  The app name is HMaster

Initiate Hadoop/HBase Services – HBase Region Server

Start HBase Master

Let us start hbase-master.


Syntax:
   sudo service <service-name> start

Sample:

   sudo service hbase-master start

Output (Text):



[dadeniji@rachel noip-2.1.9-1]$ sudo service hbase-regionserver start
[sudo] password for dadeniji:
starting regionserver, logging to /var/log/hbase/hbase-hbase-regionserver-rachel.out
[dadeniji@rachel noip-2.1.9-1]$

Output (Screen Dump):

hadoop -- service -- hbase-regionserver - start

HBase – Master – Log – Review

Review ZooKeeper Log files:

  • /var/log/hbase/hbase-hbase-regionserver-<hostname>.log
  • /var/log/hbase/hbase-hbase-regionserver-<hostname>.out

HBase – Log – Review – hbase-hbase-regionserver-<hostname>.log

  • Extensive logging

HBase – Log – Review – hbase-hbase-regionserver-<hostname>.out

  • file is empty

Hadoop/HBase – Shell

Let us play around with HBase and make sure that it is running well.

Start HBase Shell

Start Hbase Shell..


Syntax:
         hbase shell

Sample:

         hbase shell

Output (Text):


[dadeniji@rachel ~]$ hbase shell
13/05/22 13:17:58 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.2-cdh4.2.1, rUnknown, Mon Apr 22 10:56:52 PDT 2013

HBase  – Issue Basic Commands – Status

Status …



Syntax:
         status

Sample:
         status

Output (Text):

    1 servers, 0 dead, 2.000 average load

HBase  – Issue Basic Commands – Version

Status …



Syntax:
         version

Sample:
         version

Output (Text):

    0.94.2-cdh4.2.1, rUnknown, Mon Apr 22 10:56:52 PDT 2013

HBase  – Issue Basic Commands – Whoami

Who is connecting…



Syntax:
         whoami

Sample:
         whoami

Output (Text):

   dadeniji (auth:SIMPLE)

Explanation:

  • whoami — returned our username; and so we are comfortable with the fact that our actual username is being passed along to HBASE.
  • The authenticating mode is SIMPLE

 

HBase  – Issue Basic Commands – Metadata – List Tables

List Tables..



Syntax:
         list

Sample:
         list

Output (Text):

TABLE
0 row(s) in 0.1080 seconds

=> []

HBase  – Issue Basic Commands – DDL – Create Table (Sample : Customer)

Let us create a sample table:

  • Table Name :- customer
  • Compression :- SNAPPY


Syntax:

  create <table-name> , {NAME => 'cf1', COMPRESSION => <compression>}

Syntax:

  create 'customer', {NAME => 'cf1', COMPRESSION => 'SNAPPY'}

Output:

hadoop - hbase - customer -- compression - SNAPPY

HBase  – Issue Basic Commands – DDL – Describe Table (Sample : Customer)

Let us review the table’s definition:

  • Table Name :- customer


Syntax:

  describe <table-name>

Syntax:

  describe 'customer'

Output:

hadoop - hbase - describe -- Sample -- customer

HBase  – Issue Basic Commands – DDL – Disable Table (Sample : Customer)

Let us review the table’s definition:

  • Table Name :- customer


Syntax:

  disable <table-name>

Syntax:

  disable 'customer'

Output:

Hadoop - hbase - Table - Disable (Sample - Customer)

HBase  – Issue Basic Commands – DDL – Is Table Disabled (Sample : Customer)

Review table and make sure that it is indeed disabled.

  • Table Name :- customer


Syntax:

  is_disabled(<table-name>)

Syntax:

  is_disabled('customer')

Output:

Hadoop - hbase - Table - is_disabled (Sample - Customer)

HBase  – Issue Basic Commands – DDL – Drop Table (Sample : Customer)

As we have confirmed that the table is indeed disabled, let us go ahead and drop it.

  • Table Name :- customer


Syntax:

  drop <table-name>

Syntax:

  drop 'customer'

Output:

Hadoop - hbase - Table - Drop (Sample - Customer)

Stop HBase Shell

Stop Hbase Shell..


Syntax:
         exit

Sample:

         exit

 

CDH Services And Network Ports (HBase)

Here is how we are breaking in terms of listening ports

Service List / Ports

Using Chkconfig list Hadoop Services?


Syntax:

   # list all services
   sudo netstat  --ltp

Sample:

   sudo netstat -ltp

Output:

netstat -- listening port (hadoop - hbase - running)

Port Application
50010 Hadoop – HDFS – Data Node
50020 Hadoop – HDFS – Data Node
50030 Hadoop – Map Reduce – Job Tracker
50060 Hadoop – Map Reduce – Task Tracker
50070 Hadoop – HDFS – Name Node
50075 Hadoop – HDFS – Data Node
50090 Hadoop – HDFS – Secondary Name Node
58160 Hadoop – MapReduce
42067 Zookeeper
60000 Hadoop – HBase – Master
60010 Hadoop – HBase – Master
60020 Hadoop – HBase – RegionServer
60030 Hadoop – HBase – RegionServer

Diagnostic

 

Diagnostic – Log File

Log information are journaled in /var/log/hbase.

 

 

 

References

References – HBase

References – Zookeeper

References – Hadoop – Network Ports

References – Hadoop – Setup

References – Hadoop – HBase — hbase-policy.xml

Hadoop/Cloudera (v4.2.1) – Installation on CentOS (32 bit [x32])

Introduction

Here is quick preparation, processing, and validation steps for installing Cloudera – Hadoop (v4.2.1) on 32-bit CentOS.

Blueprint

I am using Cloudera’s fine documentation “http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Quick-Start/cdh4qs_topic_3_2.html” as a basis.

It is a very good documentation, but I stumble a lot for lack of education and glossing over important details.  And, so I chose to write things down.

Environment Constraints

Here is the constraint that I have to work with:

  • My lab PC is an old Dell
  • It is a 32-bit processor
  • And, so I can only install a 32-bit Linux & Cloudera Distro

Concepts

Here are a couple of concepts that we will utilize:

  • File System – Linux – Stickiness

Concepts : File System – Linux – Stickiness

Background

Sticky Bit

http://en.wikipedia.org/wiki/Sticky_bit

The most common use of the sticky bit today is on directories. When the sticky bit is set, only the item’s owner, the directory’s owner, or the superuser can rename or delete files. Without the sticky bit set, any user with write and execute permissions for the directory can rename or delete contained files, regardless of owner. Typically this is set on the /tmp directory to prevent ordinary users from deleting or moving other users’ files.

In Unix symbolic file system permission notation, the sticky bit is represented by the letter t in the final character-place.

Set Stickiness

http://en.wikipedia.org/wiki/Sticky_bit

The sticky bit can be set using the chmod command and can be set using its octal mode 1000 or by its symbol t (s is already used by the setuid bit). For example, to add the bit on the directory/tmp, one would type chmod +t /tmp. Or, to make sure that directory has standard tmp permissions, one could also type chmod 1777 /tmp.

To clear it, use chmod -t /tmp or chmod 0777 /tmp (using numeric mode will also change directory tmp to standard permissions).

Is Stickiness set?

http://en.wikipedia.org/wiki/Sticky_bit

In Unix symbolic file system permissions notation, the sticky bit is represented by the letter t in the final character-place. For instance, in our Linux Environment , the /tmp directory, which by default has the sticky-bit set, shows up as:

  $ ls -ld /tmp
  drwxrwxrwt   4 root     sys          485 Nov 10 06:01 /tmp

Prerequisites – Operating System

Introduction

Listed below are Cloudera’s stated minimal requirements (in the areas of Operating System, Database, and JDK).

For the bare minimum install we are targeting, we do not need a database.  And, only kept it in for completeness.  And, even when needed, the database itself can be on another server outside of the Cloudera node or Cluster.

Operating System

http://www.cloudera.com/content/support/en/documentation/cdh4-documentation/cdh4-documentation-v4-latest.html

  • Redhat – Redhat Enterprise Linux (v5.7 –> 64-bit, v6.2 –> 32 and 64 bit)
  • Redhat – CentOS (v5.7 –> 64-bit, v6.2 –> 32 and 64 bit)
  • Oracle Linux  (v5.6 –> 64-bit)
  • SUSE Linux Enterprise Server (SLES) (v11 with SP1 –> 64 bit)
  • Ubuntu / Debian (Ubuntu – Lucid 10.04 [LTS] –> 64 bit)
  • Ubuntu / Debian (Ubuntu – Precise 12.04 [LTS] –> 64 bit)
  • Ubuntu / Debian (Debian – Squeeze 6.03 –> 64 bit)

What does all this mean:

  • The only 32-bit OS supported is Redhat’s.  If RedHat Enterprise Linux or RedHat CentOS, then the minimum OS version# is v5.7 (x32-bit) and v6.2 (x64-bit)

Databases

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_2.html

  • Oozie (MySQL v5.5, PostgreSQL v8.4, Oracle 11gR2)
  • Hue (MySQL v5.5, PostgreSQL v8.4)
  • Hive (MySQL v5.5, PostgreSQL v8.3)

Java /JDK

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_3.html

  • Jdk 1.6 –> 1.6.0_31
  • jdk 1.7 –> 1.7.0_15

Prerequisites – Networking – Name ID

Introduction

Ensure that your Network Names are unique and they are what you want them to be.

Validate Hostname

Use hostname.



    Syntax:

         hostname

    Sample:

         hostname

Output:

Network - Hostname - Get

Set Hostname

If you the hostname is not what you thought it will be, please set it using resources available on the Net:

Prerequisites – Networking – Domain Name & FQDN

Introduction

Get Domain Name and FQDN (Fully qualified domain name)

Get Domain Name (using hostname)



    Syntax:

         hostname --domain

    Sample:

         hostname  --domain

 

Get Domain Name (using resolv.conf)



    Syntax:

         cat /etc/resolv.conf

    Sample:

         cat  /etc/resolv.conf

Output:

resolv.conf

Explanation:

  • In the file /etc/resolv.conf, your domain name is the entry prefixed by domain 

Get Hostname (FQDN)

Get FQDN (Fully qualified hostname).



    Syntax:

         hostname --fqdn

    Sample:

         hostname --fqdn

Output:

Network - Hostname - Get (FQDN)

Interpretation:

  • pinged DNS Server and discovered it is offline — Windows machine and yesterday was patched Tuesday.  And, unfortunately this particular machine needs for a key to be pressed to fully come back online…Never figured out what is up with the BIOS

Set Domain Name

Good Resources on the Net:

Prerequisites – Networking – Name Resolution

Introduction

As Hadoop is fundamentally a testament to Network Clustering and Collaborative Engineering, your working hosts have to have TCP/IP verifiable working.

Validate Hostname



    Syntax:

         ping <hostname> 

    Sample:

         ping rachel 

Network -- ping hostname -- rachel

Since we got an error message stating that “unknown host <hostname>”, we need to go to our DNS Server and make sure that we have “A” entries for them ….

Our DNS is a Windows DNS Server, and it was relatively easy to create an “A” record for it:

 

network-hostname-dns-rachel

Went back and checked to ensure that our DNS Resolution is good:

Network - hostname - dns (rachel) -- resolved

Prerequisites – wget

Introduction – wget

To download files over HTTP, but without browser, and just through the command shell, we chose to use wget.

Install – wget

sudo yum -y install wget

 

Prerequisites – lsof

Introduction – lsof

lsof is SysInternals’s process monitor for Linux.  It lets us track files and network ports being used by a process.

Install – lsof

sudo yum -y install lsof

Prerequisites – Java

Here are the steps for validating that we have the right Java JDK installed.

Java – Minimal Requirements

We need Java and we need one of the latest versions (JDK 1.6 or JDK 1.7).

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_3.html

  • For JDK 1.6, CDH4 is certified with 1.6.0_31
  • For JDK 1.7, CDH4.2 and later is certified with 1.7.0_15

Is Java installed?

Is Java installed on our box, and if so what version?

java -version

Output:

java - command not found

Get URL for Java (+JDK +JRE)

To get to the Java download, please visit:

http://www.oracle.com/technetwork/java/javase/downloads/index.html

Please note that you do not want just the JRE, but JDK (which has the JRE bundled with it).

Thus click on JDK.

As of today (2013-05-12), the latest available JDK is 7U21.

java-downloads-url

Further down on that same download page, we will notice that there is a separate download file for each OS and bitness.

java-downloads-osandbitness

As we have a 32-bit Linux that is able to use rpm, we want to capture the URL for  jdk-7u21-linux-i586.rpm.

That URL ends up being

http://download.oracle.com/otn-pub/java/jdk/7u21-b11/jdk-7u21-linux-i586.rpm

Download Oracle/Sun Java JDK

Goggled for help and found:

How to automate and download and Installation of Java JDK on Linux:

http://stackoverflow.com/questions/10268583/how-to-automate-download-and-instalation-of-java-jdk-on-linux

wget --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2Ftechnetwork%2Fjava%2Fjavase%2Fdownloads%2Fjdk6-downloads-1637591.html;"  "http://download.oracle.com/otn-pub/java/jdk/7u21-b11/jdk-7u21-linux-i586.rpm"

Error:



Resolving download.oracle.com... 96.17.108.106, 96.17.108.163
Connecting to download.oracle.com|96.17.108.106|:443... connected.
ERROR: certificate common name âa248.e.akamai.netdownload.oracle.com
To connect to download.oracle.com insecurely, use --no-check-certificate.

And, other series of problems, until I really took care to make the following changes:

  • Changed the URL from http: to https:
  • Added, the option “–no-check-certificate”

And then ended up with a working syntax:


wget --no-cookies --no-check-certificate  --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "https://download.oracle.com/otn-pub/java/jdk/7u21-b11/jdk-7u21-linux-i586.rpm" -O "jdk-7u21-linux-i586.rpm"

Installed Oracle/Sun Java JDK

Syntax:

   yum install <jdk-rpm>

Sample:

   yum install jdk-7u21-linux-i586.rpm

Output:

java-jdk-installed

Install of Java JDK was successful!

Validated Install of Oracle/Sun Java JDK

Syntax:

   java -version

Sample:

   java -version

 Output:

java-jdk-version

Install Cloudera Bin Installer (./cloudera-manager-installer.bin)

Disclaimer

Please do not go down this road on a 32-bit system.

It will not work as cloudera-manager-installer.bin is a 64-bit software and will not run on a 32-bit.

This section is merely preserved for completeness; and as a place-holder.

 

Resource

As we are targeting v4.0x, we should direct our glance @ http://archive.cloudera.com/cm4/installer/

As of 2013-04-11, here is the folder view of what Cloudera has available:

cloudera-installers-folderList

We want the latest folder:

cloudera-installers-folderList (latest)

Download

Download using wget

The URL Link is http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin.

Here is the download specification:

  • Download URL: http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
  • Output File: tmp/cloudera-manager-installer.bin

wget http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin   -O /tmp/cloudera-manager-installer.bin

Validate FileInfo

Validate FileInfo (file -i <file>)


file -i

Cloudera - Installer - file



cloudera-manager-installer.bin: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped

Explanation:

Prepare downloaded file

use chmod to make file executable.


Syntax:

   chmod u+x <file>

Sample:

  chmod u+x cloudera-manager-installer.bin

Run Installer

Run installer



sudo sudo ./cloudera-manager-installer.bin

Unfortunately, got an error message:

 



./cloudera-manager-installer.bin: ./cloudera-manager-installer.bin: cannot execute binary file

Verified that we can not install cloudera-manger-installer.bin on a 32-bit; the OS has to be a 64-bit OS.

Manage Yum Repository –  Cloudera

Background?

Once you find yourself using packages from a specific Vendor and correspondingly its repository quite a bit, I will advice you to please add that Vendor to your repository configurations.

Basically, you want to be able to do the following:

  • Aware your machine that it can safely access said repository for packages you request
  • Confirm that you trust the vendor’s GPG key 

Is Cloudera GPG key installed?

Repository keys are saved in the /etc/yum.repos.d/ folder.

Check folder

ls /etc/yum.repos.d/

Output:

Folder List -- etc:yum.repos.d:

Trust Vendor – Cloudera

 

Trust Vendor (Cloudera) by trusting its GPG Key.



Syntax:

 sudo rpm --import <key>

Sample:

 sudo rpm --import \
 http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

No feedback that the above succeeded.  But, I think we are good.

Review Vendor Repo File

Check folder

ls /etc/yum.repos.d/

Output:

Folder List -- etc:yum.repos.d: [v2]

Obviously, we now have the cloudera-cdh4.repo file in the /etc/yum.repos.d folder

Review Contents of Vendor Repo File

Review Repo File Contents

cat /etc/yum.repos.d/cloudera-cdh4.repo

Output:

View - Vendor - Repo file

Explanation:

 

Decision Time

There are a few critical decisions you have to make:

  • What is your topology – A single system or a distributed system?
  • MapReduce or Yarn

Topology – Pseudo Distributed / Cluster

If you will be using a single node, then Cloudera terms this a Pseudo Distributed.  On the other hand, if you will be using a multiple nodes, Cloudera terms this Cluster.

MapReduce (MRv1) or Yarn (MRv2)

What is the difference between MapReduce and Yarn?

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_11_4.html

MapReduce has undergone a complete overhaul and CDH4 now includes MapReduce 2.0 (MRv2). The fundamental idea of MRv2’s YARN architecture is to split up the two primary responsibilities of the JobTracker — resource management and job scheduling/monitoring — into separate daemons: a global ResourceManager (RM) and per-application ApplicationMasters (AM). With MRv2, the ResourceManager (RM) and per-node NodeManagers (NM), form the data-computation framework. The ResourceManager service effectively replaces the functions of the JobTracker, and NodeManagers run on slave nodes instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. For details of the new architecture, see Apache Hadoop NextGen MapReduce (YARN).

Can we install both MapReduce (v1) and Yarn (Map Reduce [v2]?

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Quick-Start/cdh4qs_topic_3_1.html

For installations in pseudo-distributed mode, there are separate conf-pseudo packages for an installation that includes MRv1 (hadoop-0.20-conf-pseudo) or an installation that includes YARN (hadoop-conf-pseudo). Only one conf-pseudo package can be installed at a time: if you want to change from one to the other, you must uninstall the one currently installed.

Which of them shall I use?

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_1.html

Cloudera does not consider the current upstream MRv2 release stable yet, and it could potentially change in non-backwards-compatible ways. Cloudera recommends that you use MRv1 unless you have particular reasons for using MRv2, which should not be considered production-ready.

What is our decision?

  • We will go with Map Reduce [MRv1]

Installation File Matrix

To keep ourselves honest, let us prepare a quick checklist of RPMs.

Installation File Matrix

Installation File Matrix

If we will go with a Pseudo Install, then please look for the RPMs that have Pseudo in their name.

Mode Component RPM
Pseudo Distributed Map Reduce v1 hadoop-0.20-conf-pseudo
Pseudo Distributed Map Reduce v2 (Yarn) hadoop-conf-pseudo

On the other hand, if you will like a Cluster Install, then please follow the instructions documented in http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_4.html

Cluster installs have to be performed one component at a time and on each host, and is beyond the scope of our current posting.

Install Pseudo Install

Background

We have chosen the following install path:

  • Pseudo Mode
  • MapReduce v1

Let us go get and install hadoop-0.20-conf-pseudo

Review RPM Package (hadoop-0.20-conf-pseudo)

Before we install this package, let us quickly review and make sure that it is what we want:

  • Get general info
  • Get a quick dependency list

General Info

Before we even have the file, let us check the package’s info while the package is at rest (on the Vendor’s web site):



Syntax:

   yum info --nogpgcheck  <package-name>

Sample:

   yum info --nogpgcheck  hadoop-0.20-conf-pseudo

Output:

Hadoop - yum -info -- hadoop-conf-pseudo

Explanation:

  • Name :- hadoop-0.20-conf-pseudo
  • Architecture :- i386
  • Repo :- cloudera-cdh4
  • Summary :- Hadoop installation in pseudo-distribution mode with MRv1

Repoquery Info

You can use “Repoquery –list” to check on your package, prior to downloading it.

Beforehand, make that that you have installed the YumUtils package (“sudo yum install yum-utils”).

Run repoquery:



Syntax:

   repoquery --list <package-name>

Sample:

   repoquery --list hadoop-0.20-conf-pseudo

 

Dependency Info

Dependency info



Syntax:

   yum deplist --nogpgcheck  <package-name>

Sample:

   yum deplist --nogpgcheck  hadoop-0.20-conf-pseudo

Output:

Hadoop - yum -deplist -- hadoop-conf-pseudo

Explanation:

Here are the dependencies:

  • hadoop-0.20-mapreduce-tasktracker
  • hadoop-hdfs-datanode
  • hadoop-hdfs-namenode
  • hadoop-0.20-mapreduce-jobtracker
  • hadoop-hdfs-secondarynamenode
  • /bin/sh (bash)
  • hadoop (hadoop base)

Install Rpm (hadoop-0.20-conf-pseudo)

Install rpm



Syntax:

   sudo yum install <package-name>

Sample:

    sudo yum install hadoop-0.20-conf-pseudo

Output:

hadoop-conf-pseudo -- install (confirmation?)

We respond in the affirmative….

And, the installation completed:

hadoop-conf-pseudo -- install (afirmative)

Post Installation Review – File System

Background

In Linux, it is commonly said “Everything is a File System”.

And, so let us begin by reviewing the File System (FS).

Review our package files (rpm -ql)

Show files installed by our RPM:



Syntax:

   rpm -ql <package-name> 

Sample:

   rpm -ql hadoop-0.20-conf-pseudo 

Output:

 hadoop-conf-pseudo --ql

Explanation:

  • We have the pseudo MapReduce v1 configuration files (*.xml)
  • We have the base components folders (/var/lib/hadoop, /var/lib/hdfs)

Review our configuration files

Where are those configuration files ?

Glad you asked?



Syntax:

   ls -la <configuration>

Sample:

   ls -la /etc/hadoop/conf.pseudo.mr1

Output:

List Folder -- etc-hadoop-conf.pseudo.mr1

Support for various versions

Background

To maximize flexibility, CDH supports different installed versions.  But, keep in mind only  one version can be running at the same time.

The Alternatives framework underpins this support.

Alternatives

Introduction

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Quick-Start/cdh4qs_topic_3_2.html

The Cloudera packages use the alternatives framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.

Review

Is alternatives actually in effect?

There are couple of things we can check:

  • alternatives –display <name>
  • update-alternatives –display <name>

alternatives –display



Syntax:

   sudo alternatives --display <file-name>

Sample:

   sudo alternatives --display hadoop

update-alternatives –display



Syntax:

   sudo update-alternatives --display <file-name>

Sample:

   sudo update-alternatives --display hadoop

Conclusion

Alternatives does not appear to be in play…

Component Level – User & Group – Review

Background

Let us do a quick review of our users and groups.

Hadoop – Users

Get User file (/etc/password) for well known Hadoop user accounts:

Quick, Quick, what are the well known user accounts:

  • hdfs
  • mapred
  • zookeeper
Syntax:
  cat /etc/passwd | cut -d: -f1 | egrep "xxx|yyy|zzz"

Sample:

  cat /etc/passwd | cut -d: -f1 | egrep "hdfs|mapred|zookeeper"

 

Output:

hadoop-conf-pseudo - match users

What does our little code do:

  • The file is /etc/passwd
  • Use cut passing in delimiter (:), and get first word in /etc/password
  • Match on any of the supplied users

Hadoop – Groups

Browse Group file (/etc/grown) for well known Hadoop groups and user accounts:

What are the well known Groups and what is their membership:

  • hadoop
  • hdfs
  • mapred
  • zookeeper
Syntax:
  cat /etc/group | egrep -i "xxx|yyy|zzz"

Sample:

  cat /etc/group | egrep -i "hadoop|hdfs|mapred|zookeeper"

 

What does our little code do:

  • The file is /etc/group
  • Match on any of the supplied groups

hadoop-conf-pseudo - match users and groups

Explanation:

  • Obviously, we have a group named in hadoop and its members are hdfs and mapred

Component Level – Review & Configuration – HDFS – NameNode – File System Format

 

Background

Here are a few things you should do to initialize HDFS Name Node.

HDFS – NameNode – Format

On the Hadoop HDFS Name Node, let us go ahead and format the NameNode:


sudo -u hdfs hadoop namenode -format

 

Explanation

  • The HDFS Named Services runs under the hdfs account, and so to gain access to it, let us sudo to that user name 

Output (Screen Shot):

hadoop-conf-pseudo -- hadoop namenode -format

Explanation:

  • We are able to format our namenode
  • Our default replication is 1
  • The File System is owned by the hdfs user
  • And, the File System ownership group is supergroup
  • Permission is enabled
  • High Availability (HA) is not enabled
  • We are in Append Mode
  • Our storage directory is /var/lib/hadoop-hdfs/cache/hdfs/dfs/name  

 

Reformat?

If the NameNode File System is already formatted, and you issue an HDFS format request, you will be asked to confirm that you want to re-format?

Screen shot:

HDFS - Reformat?

Text Output:

 

Re-format filesystem in Storage Directory /var/lib/hadoop-hdfs/cache/hdfs/dfs/name ? (Y or N)

Component Level – Review & Configuration – HDFS – Name Node – Temp Folder

 

Background

Like any other File System,  HDFS needs a temp folder

HDFS – Create and Grant Permissions to the Temp Folder (/tmp)

Let us create and grant the HDFS:/temp folder

Here are the particulars:

  • The HDFS folder name :- /tmp
  • The HDFS Permission :- 1777

Syntax:

  sudo -u hdfs hadoop fs -mkdir /tmp

  sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Sample:

  sudo -u hdfs hadoop fs -mkdir /tmp

  sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Explanation:

  • sudo as hdfs, issue fileSystem (fs) make-directory (mkdir) /tmp
  • Change permission to allow all to write, read, and execute

 

HDFS – Validate Folder (/tmp) Creation and Permission Set

Let us review HDFS:/tmp existence and permission set:

Introduction:

To gain access to HDFS, we do the following:

  • sudo as hdfs
  • We invoke hadoop fs
  • Our payload is -ls — List
  • Arguments : -d — Target directory and not files
  • And, we targeting the /tmp folder
Syntax:

  sudo -u hdfs hadoop fs -ls -d /tmp

Sample:

  sudo -u hdfs hadoop fs -ls -d /tmp

Output:

Hadoop - hdfs - ls - :tmp

Explanation:

  • HDFS :/tmp folder exists
  • Owner (hdfs) can read/write/execute
  • Group (supergroup) can read/write/execute
  • Everyone can read/write/execute and the sticky bit is set (t last character in the file permissions column)

Component Level – Review & Configuration – MapReduce System Directories

Background

There are quite a few HDFS folders that MapReduce needs.

HDFS – MapReduce Folders

Let us create and grant the HDFS:{MapReduce} folders:

  • Create new HDFS Folder {/var/lib/hadoop-hdfs/cache/mapred/mapred/staging}
  • Set permissions of /var/lib/hadoop-hdfs/cache/mapred/mapred/staging to 1777 – World writable and sticky-bit
  • Change the owner of  /var/lib/hadoop-hdfs/cache/mapred/mapred and sub-directories to user mapred


Syntax:

  sudo -u hdfs hadoop fs -mkdir -p \
     /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
  sudo -u hdfs hadoop fs -chmod 1777 \
    /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
  sudo -u hdfs hadoop fs -chown -R mapred \
    /var/lib/hadoop-hdfs/cache/mapred

Sample:

  sudo -u hdfs hadoop fs -mkdir -p \
     /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
  sudo -u hdfs hadoop fs -chmod 1777 \
    /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
  sudo -u hdfs hadoop fs -chown -R mapred \
    /var/lib/hadoop-hdfs/cache/mapred

 

HDFS – Validate Folder {MapReduce} Creation and Permission Set

Let us review HDFS:/var/lib/hadoop-hdfs/cache/mapred existence and permission set:

Syntax:

  sudo -u hdfs hadoop fs -ls -R /var/lib/hadoop-hdfs/cache/mapred

Sample:

  sudo -u hdfs hadoop fs -ls -R /var/lib/hadoop-hdfs/cache/mapred

Output:

cloudera-cdh-4--hdfs--folder--ls

Explanation:

HDFS :/var/lib/hadoop-hdfs/cache/mapred/mapred folder

  • Owner (mapred) can read/write/execute
  • Group (supergroup) can read/execute (but not write)
  • Everyone can read/execute (but not write)

HDFS :/var/lib/hadoop-hdfs/cache/mapred/mapred/staging folder

  • Owner (mapred) can read/write/execute
  • Group (supergroup) can read/write/execute
  • Everyone can read/write/execute and the sticky bit is set (t last character in the file permissions column)

CDH Services

Prepare Inventory of CDH Services

Service List

Here is our expected Service List.

Component Service Name
HDFS – Name Node (Primary) hadoop-hdfs-namenode
HDFS – Name Name (Secondary) hadoop-hdfs-secondarynamenode
HDFS – Data Node hadoop-hdfs-datanode
Hadoop-MapReduce – Job Tracker hadoop-0.20-mapreduce-jobtracker
Hadoop-MapReduce – Task Tracker hadoop-0.20-mapreduce-tasktracker

Using Chkconfig list Hadoop Services?


Syntax:

   # list all services
   sudo chkconfig --list 

   # list specific services, based on name
   sudo chkconfig --list | grep -i <service-name>

Sample:

   sudo chkconfig --list | grep -i "^hadoop"

Screen Shot:

hadoop-conf-pseudo -- Services

Explanation:

The services are auto-started starting from run-level 3.

Using /etc/init.d


Syntax:

   for service in /etc/init.d/<service-name>; do echo $service; done

Sample:

   for service in /etc/init.d/hadoop*; do echo $service; done

Screen Shot:

Service -- :etc:init.d

Starting CHD Services

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_27_1.html

Component Command Log
hadoop-hdfs-namenode sudo/sbin/service hadoop-hdfs-namenode start /var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.log/var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.out
hadoop-hdfs-secondarynamenode sudo /sbin/service hadoop-hdfs-namenode start /var/log/hadoop-hdfs/hadoop-hdfs-secondarynamenode-<hostname>.out
hadoop-hdfs-datanode sudo /sbin/service hadoop-hdfs-datanode start /var/log/hadoop-hdfs/hadoop-hdfs-datanode-<hostname>.out
hadoop-0.20-mapreduce-jobtracker sudo /sbin/service hadoop-0.20-mapreduce-jobtracker start /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-jobtracker-<hostname>.out
hadoop-0.20-mapreduce-tasktracker sudo /sbin/service hadoop-0.20-mapreduce-tasktracker start /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-tasktracker-<hostname>.out

Start Services – Using /etc/init.d

Look for items in the /etc/init.d/ folders that have hadoop in their names and start them.


Syntax:

   for service in /etc/init.d/<service-name>; do sudo $service start; done

Sample:

  for service in /etc/init.d/hadoop-*; do sudo $service start; done

Errors:

Here are some errors we received, because I chose not to follow instructions or jumped over some steps.

One thing I have to learn about Linux or Enterprise Systems in general is that “breverity in Instructions is sacrosanct” and you should make sure that you follow everything; or Google for help and hopefully someone else made the same mistakes and gave specific errors and resolution.

Errors – HDFS-NameNode

Here are HDFS Name Node errors.

The log file is

  • Syntax –> /var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.log
  • Sample –>  /var/log/hadoop-hdfs/hadoop-hdfs-namenode-rachel.log
Error due to name resolution error

Specific Errors:

  • ERROR org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Error getting localhost name.
  • java.net.UnknownHostException: <hostname>: <hostname>
  • at java.net.InetAddress.getLocalHost(InetAddress.java:1466)

Screen Dump:



ERROR org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Error getting localhost name. Using 'localhost'...
java.net.UnknownHostException: rachel: rachel
at java.net.InetAddress.getLocalHost(InetAddress.java:1466)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.getHostname(MetricsSystemImpl.java:496)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configureSystem(MetricsSystemImpl.java:435)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configure(MetricsSystemImpl.java:431)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.start(MetricsSystemImpl.java:180)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:156)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:54)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.initialize(DefaultMetricsSystem.java:50)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1140)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205)
Caused by: java.net.UnknownHostException: rachel
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:894)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1286)
at java.net.InetAddress.getLocalHost(InetAddress.java:1462)
... 9 more

Explanation

  • Add hostname to your DNS Server

Error due to HDFS being in an un-consistent state


FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in 
namenode join

org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: 
Directory /var/lib/hadoop-hdfs/cache/hdfs/dfs/name is in an inconsistent state: 

storage directory does not exist or is not accessible.
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:296)
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:592)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:435)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:397)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:399)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:433)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:609)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:590)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1141)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205)

INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 

Specific Errors:

  • org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode joinorg.apache.hadoop.hdfs.server.common.InconsistentFSStateException
  • Directory /var/lib/hadoop-hdfs/cache/hdfs/dfs/name is in an inconsistent state:storage directory does not exist or is not accessible.

Screen Dump:

hadoop-cosf-pseudo -- hdfs -- inconsistent state

Explanation

  • Please go ahead and format the HDFS Name Node — This should be ran on primary NameNode:
    sudo -u hdfs hadoop namenode -format
    

     

Errors – HDFS-DataNode

Here are HDFS Data Node errors.

The log file is

  • Syntax –> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-<hostname>.log
  • Sample –> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-rachel.log
Error due to host name resolution error

Screen Shot:



[dadeniji@rachel conf]$ cat /var/log/hadoop-hdfs/hadoop-hdfs-datanode-rachel.log

2013-05-13 15:24:32,357 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:

/************************************************************
STARTUP_MSG: Starting DataNode

STARTUP_MSG:   host = java.net.UnknownHostException : rachel: rachel

STARTUP_MSG:   args = []

STARTUP_MSG:   version = 2.0.0-cdh4.2.1

STARTUP_MSG:   build = file:///data/1/jenkins/workspace/generic-package-centos32-6/topdir/BUILD/hadoop-2.0.0-cdh4.2.1/src/hadoop-common-project/hadoop-common -r 144bd548d481c2774fab2bec2ac2645d190f705b; compiled by 
'jenkins' on Mon Apr 22 10:26:05 PDT 2013

STARTUP_MSG:   java = 1.7.0_21
************************************************************/

2013-05-13 15:24:32,895 WARN org.apache.hadoop.hdfs.server.common.Util: 
Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration.

2013-05-13 15:24:33,962 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain

java.net.UnknownHostException: rachel: rachel
	at java.net.InetAddress.getLocalHost(InetAddress.java:1466)
	at org.apache.hadoop.security.SecurityUtil.getLocalHostName(SecurityUtil.java:223)
	at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1694)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1719)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1872)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1893)
Caused by: java.net.UnknownHostException: rachel
	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:894)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1286)
	at java.net.InetAddress.getLocalHost(InetAddress.java:1462)
	... 6 more

2013-05-13 15:24:33,987 INFO org.apache.hadoop.util.ExitUtil: Exiting withstatus 1

2013-05-13 15:24:34,006 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: 

/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at java.net.UnknownHostException: 
rachel: rachel
************************************************************/

Explanation

  • Need to quickly go in and make sure we are able to resolve our host name; for this specific host; the host name is rachel
Error due to required service not running


WARN org.apache.hadoop.hdfs.server.common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/data should be specified as a URI in configuration files.
Please update hdfs configuration.

WARN org.apache.hadoop.metrics2.impl.MetricsConfig: Cannot locate configuration: 
tried hadoop-metrics2-datanode.properties,hadoop-metrics2.properties

INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).

INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics 
system started

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is rachel

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened streaming 
server at /0.0.0.0:50010

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing 
bandwith is 1048576 bytes/s

INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) 
via org.mortbay.log.Slf4jLog

INFO org.apache.hadoop.http.HttpServer: Added global filter 'safety'
 (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)

INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context datanode

INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static

INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter
 (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) 
to context logs

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server 
at 0.0.0.0:50075

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dfs.webhdfs.enabled = false

INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50075

INFO org.mortbay.log: jetty-6.1.26.cloudera.2

INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50075

INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 
50020

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened IPC server at /0.0.0.0:50020

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request 
received for nameservices: null

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting 
BPOfferServices for nameservices: <default>

WARN org.apache.hadoop.hdfs.server.common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/data should be specified as a URI in configuration files. 
Please update hdfs configuration.

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool 
<registering> (storage id unknown) service to localhost/127.0.0.1:8020 
starting to offer service

INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting

INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting

INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

Explanation

  • It looks like we are breaking when trying to communicate with host localhost, port 8020
  • So what is supposed to be listening on port 8020
  • Quick Google for “Hadoop” and port 8020 landed us @ http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/ and the listening service is Hadoop NameNode
  • So let us go make sure that Hadoop\Name Node is running and listening on Port 8020

Errors – MapReduce – Job Tracker

The log file is

  • Syntax –> /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-jobtracker-<hostname>.log
  • Sample –> /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-jobtracker-rachel.log
Error due to MapReduce / File System Permission Error

Specific Errors:

  • INFO org.apache.hadoop.mapred.JobTracker: Creating the system directory
  • WARN org.apache.hadoop.mapred.JobTracker: Failed to operate on mapred.system.dir (hdfs://localhost:8020/var/lib/hadoop-hdfs/cache/mapred/mapred/system) because of permissions.
  • WARN org.apache.hadoop.mapred.JobTracker: This directory should be owned by the user ‘mapred (auth:SIMPLE)’
  • WARN org.apache.hadoop.mapred.JobTracker: Bailing out …
  • org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=WRITE, inode=”/”:hdfs:supergroup:drwxr-xr-x
  • Caused by: org.apache.hadoop.ipc.RemoteException (org.apache.hadoop.security.AccessControlException): Permission denied: user=mapred, access=WRITE, inode=”/”:hdfs:supergroup:drwxr-xr-x
  • FATAL org.apache.hadoop.mapred.JobTracker: org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=WRITE, inode=”/”:hdfs:supergroup:drwxr-xr-x

Screen Dump:



INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics 
with processName=JobTracker, sessionId=
INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 8021

INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030

INFO org.apache.hadoop.mapred.JobTracker: Creating the system directory

WARN org.apache.hadoop.mapred.JobTracker: Failed to operate on mapred.system.dir (hdfs://localhost:8020/var/lib/hadoop-hdfs/cache/mapred/mapred/system) because of 
permissions

WARN org.apache.hadoop.mapred.JobTracker: This directory should be owned 
by the user 'mapred (auth:SIMPLE)'

WARN org.apache.hadoop.mapred.JobTracker: Bailing out ...
org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:205)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:186)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:135)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4684)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:4655)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:2996)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:2960)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2938)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:648)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:417)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44096)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)

Explanation

  • Having HDFS File System permission problems
  • Are the folders not created or are they created and we are only having problems with the way they are privileged?
  • I remembered that there was extended coverage of HDFS Map Reduce folder permissions in the Cloudera Docs.  Let us go review and apply those permissions

Configuring init to start core Hadoop Services

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_27_2.html

Stopping Hadoop Services

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_27_3.html

Post Installation Review

Services – Review

Commands – service –status-all



sudo service --status-all | egrep -i "jobtracker|tasktracker|Hadoop"

Output:

hdaoop-conf-pseudo - service --status-all

Commands – tcp/ip service (listening)


sudo lsof -Pnl +M -i4 -i6 | grep LISTEN

tried running lsof, but got error message:


ps  -aux 2> /dev/null | grep "java"

Screen Dump (lsof: command not found):

lsof -- command not found

Once installed lsof (via instructions previously given)

Output:

hadoop-conf-pseudo -- services - listening

Explanation:

Explanation – Java

  • We have quite a few listening Java proceses
  • The java processes are listening on TCP/IP ports between 50010 and 50090; specifically 50010, 50020, 50030, 50060, 50070, 50075
  • And, also ports 8010 and 8020

Explanation – Auxiliary Services

  • sshd (port 22)
  • cupsd (port 631)

Commands – ps (running java applications)


ps -eo pri,pid,user,args | grep -i "java" | grep -v "grep" | awk '{printf "%-10s %-10s %-10s %-120s \n ", $1, $2, $3,  $4}'

Output:

ps -java programs (v2)

Interpretation:

  • With java app one will see -Dproc_secondarynamenode, -Dproc_namenode, and -Dproc_jobtracker –> This indicator obviously maps to specific Hadoop Services

Operational Errors

Operational Errors – HDFS – Name Node

Operational Errors – HDFS – Name Node – Security – Permission Denied



mkdir: Permission denied: user=dadeniji, access=WRITE, inode="/user/dadeniji":hdfs:supergroup:drwxr-xr-x

Validate:

Check the permissions for HDFS under /user folder:


sudo -u hdfs hadoop fs -ls /user

We received:

hdfs -- Hadoop -- fs -ls

Explanation:

  • For my folder, /user/dadeniji, my folder is still owned by hdfs.

Let us go change it:


sudo -u hdfs hadoop fs -chown $USER /user/$USER

Validate Fix:


hadoop fs -ls /user/$USER

Output:

hdfs -- Hadoop -- fs -ls (fixed) [v2]

Operational Errors – HDFS – DataNode

13/05/16 15:59:08 ERROR security.UserGroupInformation: PriviledgedActionException as:dadeniji (auth:SIMPLE) cause
:org.apache.hadoop.security.AccessControlException: Permission denied: user=<username>, access=EXECUTE,

inode=”/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/<username>”

:mapred:supergroup:drwx——



13/05/16 15:59:08 ERROR security.UserGroupInformation: PriviledgedActionException as:dadeniji (auth:SIMPLE) cause
:org.apache.hadoop.security.AccessControlException: Permission denied: 
user=dadeniji, access=EXECUTE, 
inode="/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/dadeniji":
mapred:supergroup:drwx------
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:205)
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:161)
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:12
8)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4684)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkTraverse(FSNamesystem.java:4660)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:2911)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:673)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamen
odeProtocolServerSideTranslatorPB.java:643)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlocki
ngMethod(ClientNamenodeProtocolProtos.java:44128)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

When we issued:



sudo -u hdfs hadoop fs -ls  \
   /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

We received:

fs -- check staging--dadeniji (v2)

Explanation:

  • For my personalized HDFS Staging folder (/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/dadeniji), the permission set is rwx(——).
  • To me it appears that the owner (mapred) is the only account that has any permissions.
  • Cloudera Docs is very prophetic about these type of errors:

    Installing CDH4 in Pseudo-Distributed Mode
    Starting Hadoop and Verifying it is Working Properly:
    Create mapred system directories
    http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Quick-Start/cdh4qs_topic_3_2.html
    If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don’t create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it.

Let us go correct it:

As it is merely a staging folder, let us remove it, and hope the system re-creates:



sudo -u hdfs hadoop fs -rm -r \
   /var/lib/hadoop-hdfs/cache/mapred/mapred/staging/dadeniji

Once corrected we can run the MapReduce jobs.

References

References – Cloudera

References – GPG Keys

References – Java – Installation on CentOS

References – Yum – Commands

References – Network – Changing Hostname

References – ps

References – ls

References – ssh

References – Linux – User Management

References – Hadoop – HDFS

Hadoop – Hive – What is the Version # of Hive Service and Clients that you are running?

Introduction

Hadoop is a speeding bullet.  You look online, Google for things, try it out, and sometimes you hit, but often you miss.

What do I mean by that?

Well this evening I was trying to play with Hive; specifically using Sqoop to import a table from MS SQL Server into Hive.

A bit of background, my MS SQL Server table has a couple of columns declared as datetime.

Upon running the Sqoop statement pasted below:



--connect "jdbc:sqlserver://sqlServerLab;database=DEMO" \
--username "dadeniji" \
--password "l1c0na" \
--driver "com.microsoft.sqlserver.jdbc.SQLServerDriver" \
-m 1 \
--hive-import \
--hive-table "customer" \
--table "dbo.customer" \
--split-by "customerID"

 

The above command basically gives the following instruction set:

  • Via JDBC Driver (jdbc:sqlserver) connect to SQL Instance (sqlServerLab) and database Demo
  • Use the following SQL Server credentials — username – dadeniji, password – l1c0na
  • JDBC Driver’s Class name – com.microsoft.sqlserver.jdbc.SQLServerDriver
  • Number of Map Reduce Jobs (m 1)
  • Sqoop Operation — hive-import
  • Hive Table — customer
  • SQL Server Table — dbo.customer
  • Split-by — customerID

I noticed in the Sqoop console log output statements a couple of warnings:



INFO manager.SqlManager: Executing SQL statement: 
SELECT t.* FROM dbo.customer AS t WHERE 1=0

WARN hive.TableDefWriter: Column InsertTime had to be cast to a less 
precise type in Hive

WARN hive.TableDefWriter: Column salesDate had to be cast to a less 
precise type in Hive

Processing

Explore MS SQL Server

So I quick went back and looked at my SQL Server Table:

  use [Demo];
  exec sp_help 'dbo.customer';

Output:

Hadoop - Sqoop - MS SQL Server - dbo.customer

The output is congruent with my thoughts:

  • The InsertTime is a datetime column
  • The salesDate is a datetime column

Explore Hive

Launch Hive:

In shell, issue “hive” to initiate Hive Shell:


hive

List all tables:

To confirm that a corresponding table has been created in Hive, uses list


show tables;

Output:

Hadoop - Sqoop - Client - Show tables

Display Table Structure (customer):

Display table structure using describe:


Syntax:
    describe <table-name>;

Sample:

    describe customer;

Output:

Hadoop - Sqoop - Client - Describe -- customer

Explanation:

  • So it is obvious that our two original MS SQL Server Date columns (Inserttime and salesdate) were not brought in as Datetime, but String

So I am thinking why?

Hive Datatype Support

I know that the Timestamp column was not one of the original datatypes supported by Hive.  It was added per Hive version 0.8.0

This is noted in:

HortonWorks – Hive – Language Manual – Datatypes
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/datatypes.html

Determine Hive Version

There are a couple of ways to get the Hive’s Server and Client Version Number

Determine Hive Version – Command Shell – Using ps

issue ps -aux



ps -aux | grep -i "Hive"

Output (Screen shot):

Hadoop - Hive - Version -- ps --aux

Output (Text):



hive     13767  0.0  1.9 841080 159768 ?       Sl   Apr15  17:00 /usr/java/default/bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx268435456 -Djava.net.preferIPv4Stack=true -Xmx268435456 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive/lib/hive-service-0.10.0-cdh4.2.0.jar org.apache.hadoop.hive.metastore.HiveMetaStore -p 9083

410      13853  0.0  0.2 2207844 22824 ?       Ss   Apr15   0:00 postgres: hive hive 10.0.4.1(56963) idle            

410      13854  0.0  0.1 2206552 8388 ?        Ss   Apr15   0:00 postgres: hive hive 10.0.4.1(56964) idle            

dadeniji 18749  0.0  1.8 814332 152732 pts/0   Sl+  May10   0:21 /usr/java/default/bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx268435456 -Xmx268435456 -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/bin/../lib/hive/lib/hive-cli-0.10.0-cdh4.2.0.jar org.apache.hadoop.hive.cli.CliDriver

  • We have 4 processes bearing the “hive” name

Service Process

  • It is identifiable as a Hive Service via its name hive-service*.jar
  • It is running under the “hive” account name.  Its Process ID is 13767.  One of the Jar files referenced is /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive/lib/hive-service-0.10.0-cdh4.2.0.jar 
  • The Cloudera Version# is 4.2 and Hive Version# is 0.10

Client Process

  • It is identifiable as a Hive Client via its name hive-cli*.jar
  • It is running under my username (dadeniji), as I kicked it off.  Its Process ID is 18749.  One of the Jar files referenced is /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/bin/../lib/hive/lib/hive-cli-0.10.0-cdh4.2.0.jar
  • The Cloudera Version# is 4.2 and Hive Version# is 0.10

“Postgress” Process

  • Hive’s uses an embedded postgress database
  • The processes are running under account 410

Determine Hive Version – Cloudera Manager Admin Console & Command Shell

  • Launch Web Browser
  • Connect to Admin console ( http://<clouderaManagerServices>:<port>).  In our case http://hadoopCMS:7180; as Cloudera Manager Service is running on a machine named hadoopCMS and we kept the default port# of 7180
  • The initial screen displayed in the Service Status page (/cmf/services/status)
  • Click on the service we are interested in (hive1)
  • The service’s specific “Status and Health Summary” screen is displayed.  In this case “Hive1 – Services and Health Summary” page
  • In the row labelled “Hive MetaStore Server” Click on the link underneath the “Status” column
  • This will bring you to the “hivemetastore” summary page.
  • For each Hive host, Hive process information and links the Hive Logs are displayed
  • On the “Show Recent Logs” row, click on “Full Stdout” log
  • The stdout.log appears – Here is break of what is provided

stdout.log


Mon Apr 15 21:06:24 UTC 2013
using /usr/java/default as JAVA_HOME

using 4 as CDH_VERSION

using /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive 
    as HIVE_HOME

using /var/run/cloudera-scm-agent/process/22-hive-HIVEMETASTORE 
    as HIVE_CONF_DIR

using /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop 
    as HADOOP_HOME

using /var/run/cloudera-scm-agent/process/22-hive-HIVEMETASTORE/hadoop-conf as HADOOP_CONF_DIR

Starting Hive Metastore Server

Java version

We quickly see that JAVA_HOME is defined as /usr/java/default.

To see what files constitute /usr/java/default

  ls /usr/java/default

Output:

Hadoop - Clopudera Manager - Java Version

Explanation:

  • /usr/java/default is symbolically linked to /usr/java/latest
  • /usr/java/latest is symbolically linked to /usr/java/jdk1.7.0_17 
Cloudera Distribution version

Based on the screen shot below, the CDH Version is 4

using 4 as CDH_VERSION
Hive Home

Based on the screen shot below, the Hive Home is /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive

using /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive as HIVE_HOME

Again, let us return to the command shell and see what files are in /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive

Please add /lib suffix to get to the Jar files and only get only jar files that have hive in their names.



ls /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive/lib/hive*.jar

Output:

Hadoop - Cloudera Manager - Server - ls Sqoop Jar files

Cloudera Manager Admin Console – Service Status

Hadoop - Cloudera Manager - Services - Status

Cloudera Manager Admin Console – Hive1 – Status and Health Summary

Hadoop - Cloudera Manager - Services - Status and Health Summary

Cloudera Manager Admin Console – Hive1 – Status Summary

Hadoop - Cloudera Manager - Hive - Status

Cloudera Manager Admin Console – Hive1 – Status Summary – Log – Stdout.log 

Hadoop - Cloudera Manager - Hive - Status - Log - stdout

Conclusion

It thus appears that we are running a version of Hive (0.10) in this case that it did not support the TimeStamp datatype.

The problem can also be with the version of Sqoop we have running or Sqoop’s ability to detect SQL Server’s datetime datatype or datetime data representation in general.