About

Edit photo

Monday, November 28, 2016

Aggregation functions in PIG



Example1:
year price
2015 60
2014 45
2016 34
2014 75
2015 45
2014 41
I would like to get the result as, total price of the item by year.

grunt> file = load '/tmp/yearprice.txt' using PigStorage(',') as (year:int,price:float);

grunt> data = foreach file generate year, price by $0 > 1;

grunt> group = group data by year;

grunt> sum = foreach group generate group as year, SUM(data.price);

grunt> dump sum;

2014 161
2015 105
2016 34


Example 2:

id1, 1,on,400 

id1, 2,off,100

id2, 3,on,200
i would like to get the result as "sum of $3 if $2 is 0, by ID $0"

grunt> file = load '/tmp/file' using PigStorage(',');

grunt> refineData = foreach file generate $0, $1, (($2 == 'on') ? $3 : 0);

grunt> grp = group refineData by $0;

grunt> sum = foreach grp generate group as id, SUM(refineData.$2);

grunt> dump sum;

id1, 500
id2, 200

Wednesday, November 16, 2016

PIG left, right outer joins error



We know that, we can perform JOIN operation using pig, as well as left outer join, right outer join. sometimes you'll face error when working of left and right outer joins, but not to inner joins. that is because of schema.

Even if there is no schema can perform inner join, but not for left and right outer joins.

Left, right, outer joins will not be executed without well-defined bag schema. 

Left Join: Example, the below left join will not work, but


The following left join will work, because it has schema.

grunt> emp = load 'emp.txt' using PigStorage('|') as  (eid:int,name:chararray,sal:bytearray);
grunt> dept = load 'dept.txt' using PigStorage('|') as (did:int,eid:int,name:chararray);
grunt> ljoin = join emp by eid left,dept by eid;
grunt> dump ljoin;



Right Join:
grunt> rjoin = join emp by eid right,dept by eid;
grunt> dump rjoin;