About

Edit photo

Wednesday, March 23, 2016

What is Hive - Step by Step Part1



Apache hive is one of the component of Hadoop built on top of HDFS and is a solution of data warehouse in Hadoop.

It uses SQL Like language called HiveQL (Open Source) and it is for STRUCTURED DATA only, Generates MapReduce jobs that run on the Hadoop cluster. Originally developed by Facebook.

Why Hive?
o   More productive than writing MapReduce directly
§  Five lines of HiveQL might be equivalent to 100's of lines of Java code.
o   Brings large-scale data analysis to a broader audience
§  Leverage existing knowledge of SQL
o   Offers interoperability with other systems
§  Extensible through Java and external scripts
§  Many BI tools support Hive
              
Hive is associated with metastore.
Metasore is the internal data store of the HIVE. I.e. all the tabular metadata info will get stored in metastore. They are
i)                 table name
ii)                schema definition
iii)               column info
iv)               partition key if any
              
NOTE: Default database of Hive is Derby DB

How to configure metastore in Hive?

Modify the file in "hive/conf/hive-site.xml"
i)                 Connection URL details
ii)                Driver class name details


                             
HiveQL Datatype’s:-       
TinyInt, SmallInt, Int, BigInt, floatdouble, String         
         
Collection types:-
Map, array, struct

NOTE: Every table in Hive is created as a Directory

Differences between SQL and HiveQL



How Hive loads and Stores Data?

-        Hive’s queries operate on tables, just like RDBMS
o   A table is simply an HDFS directory containing one or more files
o   Default path: /user'lhive/warehouse/<tab1e_name>
o   Hive supports many formats for data storage and retrieval

-        How does Hive knows the structure and location of tables?
o   These are specified when tables are created
o   This metadata is stored in Hive’s metastore
§  Contained in an RDBMS such as MySQL

-        Hive consults the metastore to determine data format and location
o   The query itself operates on data stored on a filesystem (typically HDFS)


Hive Tables:
               1) Managed Tables (Internal Tables)
               2) External Tables


Part 2: Click Here
                                                                                                                  @SsaiK

0 comments:

Post a Comment