Introduction
As a data-warehouse project, Hive does not support the traditional singleton database insert record statements.
Data inserts has to be through bulk-record modes. One of those avenues is through a built-in load statement.
Data Model
Data Model – Logical
Looked online for sample data models that are representative of familiar data entities. And, the one I finally settled on is a book entity.
It happens to be an XML representation:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms762271(v=vs.85).aspx
I think we all remember when XML was going to take over the world as it crosses OS and cultural boundaries dues to its preciseness.
I am not sure how all that argument is now being jettisoned for no-sql.
Nevertheless, here is the Microsoft definition of a book entity:
|
Data Model – Physical
Here is Hive’s implementation of our logical data model.
|
Hive – Create Table
Prepare – Create Table Statement
It is a bit easier to just use an editor to enter your ddl statements.
We used vi to create /tmp/hive/helloWorld/dbBook__CreateTable__delimited.sql.
drop table bookCharacterDelimited; create table bookCharacterDelimited ( author string , title string , edition string , genre string , price decimal , publishDate timestamp , description string , haveit binary , acquireDate timestamp ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '~' STORED AS TEXTFILE ; quit;
Actual – Execute Create Table Statement
Once we have a file that contains our create statement, we invoke hive client and indicate that we will like to submit a sql file via -i.
hive -i /tmp/hive/helloWorld/dbBook__CreateTable__Delimited.sql
Hive – Table – Data
Hive – Table – Data
Create a text file. The name of the text file is /tmp/book.txt
Mohammed Khaled~Valley of the doors~1~Fiction1~10.45~2013-01-01~Afghan~0 Jackie Collins~Accident~2~12.56~TECH~27.87~2012-01-17~Lovely~1 Uwem Akpan~Say you're one of them~Fiction Elizabeth Berg~Open House~Fiction Kaye Gibbons~A Virtuos Woman Khaled Hosseini~And the mountains echoed Valorie Schaefer~The Care and Keeping of you~1 Malcom Gladwell~Outliers Carmen Reinhart & Kenneth Rogoff~This time is different: Eight Centuries of financial folly Steven Pinker~The Better Angels of our Nature: Why Violence has declined Vivien Stewart~A World-Class Education
Hive – Table – Data – Load
Invoke Hive Client (hive) to load the data:
hive -e "load data local inpath '/tmp/book.txt' into table bookCharacterDelimited;"
Hive – Table – View Data
Hive – Table – View Data
Use hive query tool to view the data:
hive -e "select * from bookCharacterDelimited;"
Hive – Create Table – Problem Areas
Introduction
Let us try using a different delimiter (;)
Prepare – Create Table Statement
Create Statements placed in /tmp/hive/helloWorld/dbBook__CreateTable__Delimited__SemiColon.sql
drop table bookCharacterDelimited__SemiColon; create table bookCharacterDelimited__SemiColon ( author string , title string , edition string , genre string , price decimal , publishDate timestamp , description string , haveit binary , acquireDate timestamp ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' STORED AS TEXTFILE ; quit;
Actual – Execute Create Table Statement
Once we have a file that contains our create statement, we invoke hive client and indicate that we will like to submit a sql file via -i.
hive -i /tmp/hive/helloWorld/dbBook__CreateTable__Delimited__SemiColon.sql
Output (Textual):
- FAILED: ParseException line 19:22 mismatched input ‘<EOF>’ expecting StringLiteral near ‘BY’ in table row format’s field separator
Output (Pictorial):
Explanation:
- It seems that the Hive application is quite picky about delimiters that are allowed
- If delimiters are not specified, the default choice is ctrl-A
Hive – Create Table – Native
- We created a new file /tmp/hive/helloWorld/dbBook__CreateTable__delimited.sql
- A delimiter is not specified and so we assume the default of CTRL-A
drop table booknative; create table booknative ( author string , title string , edition string , genre string , price decimal , publishDate timestamp , description string , haveit binary , acquireDate timestamp ) ROW FORMAT DELIMITED STORED AS TEXTFILE ; quit;
Replace original tilde (~) with CTRL-A
using sed or tr replace the original (~) with Ctrl-A (Hex 01)
In this case we chose to use tr:
cat dbBook_Data_in.txt | tr '~' $'\x01' > dbBook_Data_in_native.txt
Explanation:
- The original file name is dbBook_data_in.txt
- replace ~ with \x01
- The resultant file name is dbBook_Data_in_native.txt
Load data into Hive Table
load data / overwrite contents
hive -e "load data local inpath '/tmp/hive/helloWorld/dbBook_Data_in_native.txt' overwrite into table bookNative;'
Validate data
validate data:
hive -e "select * from bookCharacterDelimited;"
References
References – Hive – Data Definition Language
- Language Manualhttps://cwiki.apache.org/Hive/languagemanual-cli.html
- Data Definition Languagehttps://cwiki.apache.org/Hive/languagemanual-ddl.html#LanguageManualDDL-DropTable
References – Hive – Data Types
- Hive – Language – Manual Typeshttps://cwiki.apache.org/Hive/languagemanual-types.html
References – Control Characters
- Using vi as a hex editorhttp://www.linuxinstead.com/blog/2009/04/21/using-vi-as-a-hex-editor/
- Processing Control Charactershttp://unix.stackexchange.com/questions/5851/processing-control-characters
- Echo tr 012 and 001http://www.computing.net/answers/unix/aecho-tr-012-001-/7647.html