Friday, June 17, 2016

PIG Transformations or Operators - Part 2

Loading the data into “base”
grunt> base = load 'filepig.txt' using PigStorage('|') as (empid:int,ename:chararray,salary:int);

PigStorage(‘|’) è is used to separate the columns using delimiter ‘|’
The filepig.txt contains 3columns separated with delimiter “|”, that was done by PigStorage(‘|’) and name each column with title by using as (empid:int,ename:chararray,salary:int); ,so empid represents 1st column, ename represents 2nd column, salary represents 3rd column.

The above line load the data from filepig.txt into base operator.

We can check the operator description by using describe base;

foreach and generate: goes through the each column and generates results.
grunt> eid = foreach base generate empid;

DUMP: dump is used to get the result from the operators/variables.
grunt> dump base;

FILTER: filter is used to filter data, example filter the employees who has salary more than 4000.
grunt> moresal = filter base by salary==4000;

dump moresal;

ILLUSTRATE: shows schema how the data has been stored internally.
grunt> illustrate base;

Used to explain logical plan, map and reducer plan and scope of the operands/variables.
grunt> explain emp;
Order By:
grunt> ordersalary = order base by salary;
grunt> dump ordersalary;

Group BY:
grunt> group_data = group emp by eid;
grunt> dump group_data;

grunt> describe group_data;


Let’s take another table, name it as dept.txt, and rename the old file as emp.txt, so it doesn’t confuse.
Move the dept and emp files to HDFS.
$ hadoop fs -put dept.txt
$ hadoop fs -put emp.txt

grunt> emp = load 'emp.txt' using PigStorage('|');
grunt> dept = load dept.txt' using PigStorage('|');

grunt> joindata = join emp by $0, dept by $1;
grunt> dump joindata;
here $0 represents, the first column in the emp Bag and $1 represents second column of Dept Bag.

Left Join ->
Left, right, outer joins will not be executed without well-defined bag schema. Example, the below left join will not work, but

The following left join will work, because it has schema.
grunt> emp = load 'emp.txt' using PigStorage('|') as (eid:int,name:chararray,sal:bytearray);
grunt> dept = load 'dept.txt' using PigStorage('|') as (did:int,eid:int,name:chararray);
grunt> ljoin = join emp by eid left,dept by eid;
grunt> dump ljoin;

Right Join:
grunt> rjoin = join emp by eid right,dept by eid;
grunt> dump rjoin;


grunt> store rjoin into '/user/hadoop/rjoin_putput'

[hadoop@localhost blog]$ hadoop fs -cat /user/hadoop/rjoin_putput/part-r-00000

Union is used to combine two similar bags.
Union_data = union emp,dept;

Is used to multiple the second operator/variable data with first operator/variable.
grunt> cross_data = cross emp,dept;
it multiple dept data by emp data.
grunt>dump cross_data;

Is used to limit the output by number lines.
grunt> limit_data = limit cross_data 3;
grunt> dump limit_data;
it limit the output by 3 lines.

Flatten, AggFunctions, Distinct, Cogroup, Tokenizer

