About

Edit photo

Wednesday, March 23, 2016

What is Hive - Step by Step Part1



Apache hive is one of the component of Hadoop built on top of HDFS and is a solution of data warehouse in Hadoop.

It uses SQL Like language called HiveQL (Open Source) and it is for STRUCTURED DATA only, Generates MapReduce jobs that run on the Hadoop cluster. Originally developed by Facebook.

Why Hive?
o   More productive than writing MapReduce directly
§  Five lines of HiveQL might be equivalent to 100's of lines of Java code.
o   Brings large-scale data analysis to a broader audience
§  Leverage existing knowledge of SQL
o   Offers interoperability with other systems
§  Extensible through Java and external scripts
§  Many BI tools support Hive
              
Hive is associated with metastore.
Metasore is the internal data store of the HIVE. I.e. all the tabular metadata info will get stored in metastore. They are
i)                 table name
ii)                schema definition
iii)               column info
iv)               partition key if any
              
NOTE: Default database of Hive is Derby DB

How to configure metastore in Hive?

Modify the file in "hive/conf/hive-site.xml"
i)                 Connection URL details
ii)                Driver class name details


                             
HiveQL Datatype’s:-       
TinyInt, SmallInt, Int, BigInt, floatdouble, String         
         
Collection types:-
Map, array, struct

NOTE: Every table in Hive is created as a Directory

Differences between SQL and HiveQL



How Hive loads and Stores Data?

-        Hive’s queries operate on tables, just like RDBMS
o   A table is simply an HDFS directory containing one or more files
o   Default path: /user'lhive/warehouse/<tab1e_name>
o   Hive supports many formats for data storage and retrieval

-        How does Hive knows the structure and location of tables?
o   These are specified when tables are created
o   This metadata is stored in Hive’s metastore
§  Contained in an RDBMS such as MySQL

-        Hive consults the metastore to determine data format and location
o   The query itself operates on data stored on a filesystem (typically HDFS)


Hive Tables:
               1) Managed Tables (Internal Tables)
               2) External Tables


Part 2: Click Here
                                                                                                                  @SsaiK

Wednesday, March 16, 2016

What is SQOOP? - step by step


::SQOOP::  (SQL + HADOOP)

Sqoop is one of the component of Hadoop, built on top of HDFS is meant for interacting with RDBMS systems.

SQOOP == SQL + HADOOP

RDBMS(Oracle or MySQL): Relational DB carries structured data. Sometimes if you are getting huge amount of data, it's hard to store and process it in RDBMS. SO, better to move it to HDFS
Moving from RDBMS to HDFS needs a tool, that is called SQOOP.

Below are the some of the key observation with respect to sqoop


  • Either sqoop import/export will work with HDFS only (means there is no intervention of LFS)
  • To interact with any RDBMS using sqoop, the target RDBMS should be java compatible.
  • If you are interacting with any RDBMS using sqoop, it needs specific RDBMS connector. i.e. jar file should be part of sqoop installed lib directories.
What can we do using Sqoop:

  • import entire table by using sqoop.
  • import part of the table with "where" clause or with "column" clause.
  • import all tables from a specified DB.
  • export a table from HDFS to RDBMS.

::connect to RDBMS via sqoop:
sqoop import --connect jdbc:mysql://<ip address>/<DBname> --table <Table Name>;

Sometimes you'll get the error as access denied, because of privileges. Here i'm using it locally, so use the following code.

grant all privileges on  <DBname>.* to '%'@'localhost';

for only one user, use username instead of %.

By default Sqoop takes 4 Mappers to perform task. It can be change by using -m <num of mappers>
Default field/delimiter separator is ","

Syntax: sqoop job [GENERIC-ARGS] [JOB-ARGS] [-- [<tool-name>] [TOOL-ARGS]

Following are some common arguments used mostly in sqoop.

Change the Directory by --target-dir '<path>'

Can use different delimiter instead of "," by Field terminated : --field-terminated-by '|'

Import column wise by --columns 'empid,ename'

Filter --where 'esal>2000'

Can get sqoop available commands by using : sqoop help

List of Databases: sqoop list-databases --connect jdbc:mysql://localhost;

list of tables: list-tables

Import all tables from DB: sqoop import-all-tables --connect jdbc:mysql://localhost/batch9 -m 1

eval: evolve and display results
If there is "eval and --query" that results in the RDBMS, not in HDFS. as well as "import and --query" is results in the HDFS, not in RDBMS.

Example of eval and query:
sqoop eval --connect jdbs:mysql://localhost/batch9 --query "select * from emp"; 

Example of "import and --query"
sqoop import --connect jdbs:mysql://localhost/batch9 --query "select * from emp where \$CONDITIONS" --target-dir '/import9query';

sqoop import --connect jdbs:mysql://localhost/batch9 --query "select * from emp where sal> 1000 \$CONDITIONS" --target-dir '/import9query';


NOTE: where \$CONDITIONS --> should be present for all queries.





Import Binay data: 
sqoop import --connect jdbs:mysql://localhost/batch9 --table emp -m 1 --where 'sal> 1000' --target-dir '/import9query' --as-sequencefile;

avro file: --as-avrodatafile;

Job management arguments:
sqoop job [GENERIC-ARGS] [JOB-ARGS] [-- [<tool-name>] [TOOL-ARGS]
--create <job-id> creates/Saves a new job-id
--delete <job-id> Delete a saved job.
--exec <job-id> Run a saved job.
--help print usage instructions
--list List saved jobs
--meta-connect <jdbc-uri> specify JDBC connect string for the meta-store
--show <job-id> show the parameters for a saved job
--verbose print more information while working

EXPORT: Some key observations with respect to Sqoop
1) Before export, related schema should be present well in Advance.
2) Fields order and type should be matched.
3) If at all we are exporting multiple files, even single record is not matching the entire export is going to fail.

sqoop export --connect jdbc:mysql://localhost/<DB> --table emp --export-dir '<filename_path>' --fields-terminated-by '|';


ERROR: Sometimes we face an issue like "--split-by", comes if the table contains primary key.
Resolve it by using number of mappers is equal to 1 or add primary key option in sqoop, or add --split-by to sqoop.

Comment below for any queries and share it. @SsaiK

Program to find Length of each word in Hadoop


Following program is to find the length of the each word in the given input file.

For example, if your input file contains the following info

"Hi SsaiK
this program help someone who new to Hadoop"

Output:

Hadoop 6
help 4
Hi 2
new 3
program 7
someone 7
SsaiK 5
this 4
to 2
who 3

package tut.ssaik.com.lenofword;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class lengthofword {
 public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
   String line = value.toString();
   String[] letters = line.split(" ");

   for (String letter : letters) {
    context.write(new Text(letter), new IntWritable(letter.length()));
   }
  }
 }

 public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, IntWritable values, Context context) throws IOException, InterruptedException {

   context.write(key, values);
  }
 }

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
  if (otherArgs.length != 2) {
   System.err.println("Usage: WordCount <in> <out>");
   System.exit(2);
  }
  Job job = new Job(conf, "word count by www.ssaik.com");
  job.setJarByClass(lengthofword.class);
  job.setMapperClass(TokenizerMapper.class);
  job.setCombinerClass(IntSumReducer.class);
  // filesystem fs = filesystem.get(conf);
  job.setReducerClass(IntSumReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
  FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
  System.exit(job.waitForCompletion(true) ? 0 : 1);
  // if(fs.exists(outputDir))
  // fs.delete(outputDir,true);
 }
}



Click here to find Number of vowels and consonants in the given file.
Please comment below for any other queries.

Monday, March 7, 2016

What is PIG? - Hadoop


Apache PIG is meant for process the data i.e., for data summarization, querying and advanced querying.

Pig has its own scripting language called, pig Latin scripting. Whenever we are executing PIG actions, internally Map Reduce jobs are getting triggered by the framework.

Yahoo introduced PIG, and it is one of the great Apache project now. Pig has built on top of MapReduce.

PIG has BAG is can be ordered or disordered table and each bag has two attributes, called tuples and atoms.


NOTE: PIG will never has Metadata and Warehouse.PIG can work with 3 modes, they are LOCAL, HDFS and EMBEDDED MODE.

LOCAL mode: Input data will be taken from LFS path (Not from HDFS) and once the processing is completed the generated output will also be part of LFS path, means in the Local mode of PIG interacting, there is no intervention of HDFS.
So, here frame will be copy this data into temporary HDFS path and initiate MapReduce jobs and then process the data. Once data has been processed, again framework will copy that data into LFS path.
$> pig -x local
If there is script, use $> pig -x local <<script-name>>

HDFS mode: Input is taken from HDFS, and output is into HDFS.
$> pig
if there is a script, use $> pig <<script-name>>

EMBEDDED mode: if at all we are not able to achieve desired functionality using existing commands or operations or actions we can choose embedded mode to develop customized application.
I.e. UDF's - User Defined Functions

Following are the Default Transformations or operators in PIG:

1) LOAD                                             2) FOREACH                                      3) GENERATE
4) FILTER                                           5) DUMP                                           6) STORE
7) DESCRIBE                                      8) SPLIT                                             9) ORDER BY
10) GROUP BY                                 11) JOIN                                            12) UNION
13) CROSS                                         14) LIMIT                                          15) TOKENIZE
16) EXPLAIN                                     17) ILLUSTRATE                               18) FLATTEN
19) AGGFUNCTIONS                       20) DISTINCT                                    21) COGROUP

Following Datatypes are used in PIG

DATA TYPES                       PIG LATIN DATATYPES
Int                                                                      int
   String                                                              chararray
float                                                                   float
long                                                                   long
boolean                                                             boolean
byte                                                 (Default) bytearray

TIP: Execution of PIG can be done in 2 flavors, they are grunt shell (i.e. line by line) and script mode (i.e. Group of commands)

Sunday, March 6, 2016

Vowel and Consonant map reduce program in Hadoop



The following program is to finds number of vowels and Consonants in the given input file, for example, 

my input file is "vandc.txt" Contains::

Hello there!
i love you.

it contains 1 vowel ("i") and 4 consonants ("Hello, There!, love, you."), So the output would be vowels 1
consonants 4

How to execute the program, please click here.

package tut.ssaik.com.vandc;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class vowel {
 public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
   String line = value.toString();

   String[] letters = line.split(" ");

   for (String letter : letters) {
    if (isVowel(letter))
     context.write(new Text("Vowel"), new IntWritable(1));
    else
     context.write(new Text("Consonant"), new IntWritable(1));
   }
  }

  public boolean isVowel(String line) {
   char wr = line.charAt(0);
   if (wr == 'a' || wr == 'e' || wr == 'i' || wr == 'o' || wr == 'u') {
    return true;
   } else {
    return false;
   }
  }
 }

 public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
  private IntWritable result = new IntWritable();

  public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {
   int sum = 0;
   for (IntWritable val : values) {
    sum += val.get();
   }
   result.set(sum);
   context.write(key, result);
  }
 }

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
  if (otherArgs.length != 2) {
   System.err.println("Usage: WordCount <in> <out>");
   System.exit(2);
  }
  Job job = new Job(conf, "word count");
  job.setJarByClass(vowel.class);
  job.setMapperClass(TokenizerMapper.class);
  job.setCombinerClass(IntSumReducer.class);
  // filesystem fs = filesystem.get(conf);
  job.setReducerClass(IntSumReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
  FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
  System.exit(job.waitForCompletion(true) ? 0 : 1);
  // if(fs.exists(outputDir))
  // fs.delete(outputDir,true);

 }
}


**Comment below for encourage.**
post by SsaiK

Tuesday, March 1, 2016

What is combiner in hadoop?


What is combiner?
Combiner is a Mini Reducer that performs the local reduce task, because Combiner does the same work of Reducer and uses the same program of Reducer or can use custom code too. It receives the input from the mapper on a particular node and sends the output to the reducer. 

Use of combiner?
If combiner is doing the same work of reducer, then do we need combiner? yes, Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.

How data is efficient?
Combiner does the work of reducer in the mapper it self and send the data to the reducer, and reducer combines all the data from all the mappers and produce final result in the form of <key, value>
for example, take the below example, Mapper produces output in <key, value>. the following data should be sent to the reducer via network. 

before reducer there is another phase, called shuffle & sort. used to shuffle the data and sort it in ascending order. Lets take the below picture, after shuffle and sort, assume the output is 100mb, takes 60sec to reach the reducer over network.



but if combiner comes in the picture, the data may be decreased depends on the number of repeated keys. if there are any repeated keys, the data is decreased, assume there are repeated keys, so the combiner runs reducer program in each mapper, so the size of the mapper output is decreased somewhat, guess 80mb, so it takes 40sec to reach reducer over network. So performance is increased by minimizing the data size.
example:



Reducer combines the repeated data and gives output as in the name of "part-r-00000".

How to use combiner in program?
It's very simple, just Add setCombinerClass(<reducercode>.class); in Driver Code, here can use reducer code or any custom code of your choice.

job.setJarByClass(lengthofword.class);
job.setMapperClass(mapperProg.class);
job.setCombinerClass(reducerProg.class);
job.setReducerClass(reducerProg.class);
job.setOutputKeyClass(Text.class);

NOTE:
combiner may not work if there is no free mapper available in the cluster.

If you like share my post and give boost. Post by @SsaiK