About

Edit photo

Monday, November 28, 2016

Aggregation functions in PIG



Example1:
year price
2015 60
2014 45
2016 34
2014 75
2015 45
2014 41
I would like to get the result as, total price of the item by year.

grunt> file = load '/tmp/yearprice.txt' using PigStorage(',') as (year:int,price:float);

grunt> data = foreach file generate year, price by $0 > 1;

grunt> group = group data by year;

grunt> sum = foreach group generate group as year, SUM(data.price);

grunt> dump sum;

2014 161
2015 105
2016 34


Example 2:

id1, 1,on,400 

id1, 2,off,100

id2, 3,on,200
i would like to get the result as "sum of $3 if $2 is 0, by ID $0"

grunt> file = load '/tmp/file' using PigStorage(',');

grunt> refineData = foreach file generate $0, $1, (($2 == 'on') ? $3 : 0);

grunt> grp = group refineData by $0;

grunt> sum = foreach grp generate group as id, SUM(refineData.$2);

grunt> dump sum;

id1, 500
id2, 200

Wednesday, November 16, 2016

PIG left, right outer joins error



We know that, we can perform JOIN operation using pig, as well as left outer join, right outer join. sometimes you'll face error when working of left and right outer joins, but not to inner joins. that is because of schema.

Even if there is no schema can perform inner join, but not for left and right outer joins.

Left, right, outer joins will not be executed without well-defined bag schema. 

Left Join: Example, the below left join will not work, but


The following left join will work, because it has schema.

grunt> emp = load 'emp.txt' using PigStorage('|') as  (eid:int,name:chararray,sal:bytearray);
grunt> dept = load 'dept.txt' using PigStorage('|') as (did:int,eid:int,name:chararray);
grunt> ljoin = join emp by eid left,dept by eid;
grunt> dump ljoin;



Right Join:
grunt> rjoin = join emp by eid right,dept by eid;
grunt> dump rjoin;


Monday, June 20, 2016

What is HTTP Tunneling - Connections Through Restrictions


It is used to bypass firewalls and other network restrictions and an HTTP tunnel is used to create a direct network link between two locations.


But before proceeding further, let us look into the meaning of some of the terms we will use in this context.

Commonly Used Terms

HTTP

HTTP or Hyper Text Transfer Protocol is the network protocol or language used by web browsers to communicate with web servers. HTTP defines how messages should be formatted and transmitted, what actions web servers and browsers should take in response to various commands. For example, when you enter a URL as the web address, an HTTP command is sent to the web server directing it to fetch and transmit the requested web page. HTTP is called a stateless protocol because each command is executed independently, without any knowledge of the preceding commands.

Tunneling

Tunneling, also known as “port forwarding,” is the method of transmitting private network data and protocol information through public network by encapsulating the data.   Tunneling is when instead of sending a packet directly through the network, the data is sent inside another encrypted connection. Tunneling can be achieved by nearly all protocols. A tunnel is used to ship a foreign protocol across a network that normally wouldn’t support it. You can take protocol A and wrap it or put it in a tunnel with protocol B. If you want to refresh your knowledge on TCP/IP models, Network Protocols, Network security, VPNs then CompTIA Network+ N10-005 is an excellent course to browse through.
Tunneling provides the basic underlying structure for setting up a VPN or Virtual Private Network. VPN is a network that is constructed by using public connections, usually the internet to connect to a private network, such as a company’s internal network. The process involves use of encryption and transmission protocols to create secure virtual tunnels for data transmission. This ensures that only authorized users can access the network and that the data cannot be intercepted. Data is transmitted in the form of packets over the Internet. The information contained in a data packet is called the payload and contains the routing information required to transmit the packet to a remote destination.
In VPN connection, a tunnel provides a secure medium for data exchanged between the corporate intranet, remote users, and networks of branch offices, suppliers, and business partners. The creation of a tunnel requires the following:
  • Carrier protocol: This is the network transport protocol to be used as the carrier protocol, for example PPP.
  • Encapsulation protocol: This protocol will encapsulate the payload of a data packet.
  • Passenger protocol: Refers to the protocol used by the by the data packets that are being transmitted through the tunnel. NetBEUI is an example of passenger protocol.

What is HTTP Tunneling?

HTTP tunneling is the process in which communications are encapsulated by using HTTP protocol. An HTTP tunnel is often used for network locations which have restricted connectivity or are behind firewalls or proxy servers. A firewall is typically a computer and software that sits between a group of client users and the wider outside Internet or intranet. The firewall is used to protect the internal client network from unauthorized access from outside the firewall.
Certain networks may have restricted connectivity in the form of blocked TCP/IP ports. Traffic is restricted from outside the network to secure it from internal and external threats and most network protocols are restricted except a few which are used for secured communication.  If a user, for example, in a corporate environment has no permission to open TCP connections the user cannot use certain services or connect to the internet. In such a case, HTTP tunneling is a possible solution, when the protocol is encapsulated inside HTTP requests that can pass through the firewall or HTTP proxy. In HTTP tunneling, HTTP protocol acts as a wrapper for a channel that the network protocol being tunneled uses to communicate. HTTP tunnel software is used for this purpose which consists of client-server HTTP tunneling applications that integrate with existing application software, and allow them to communicate in restricted network connectivity. The application plays the role of a tunneling client. HTTP tunnel clients are used to access applications from behind restrictive firewalls or proxy servers, to access blocked sites, or to share confidential resource over HTTP securely.

How to Implement HTTP Tunneling?

When a HTTP connection is made through a proxy server, the client, which is usually the browser, sends the request to the proxy. The proxy opens the connection to the destination, sends the request, receives the response and sends it back to the client. The HTTP protocol specifies a request method called CONNECT. The CONNECT method can be used by the client to inform the proxy server that a connection to some host on some port is required. The proxy server tries to connect to the destination address specified in the requested header. If the operation fails, it sends a negative HTTP response back to the client and closes the connection. If the operation succeeds, it sends back an HTTP positive response and the connection is established. After that, the proxy server transmits and forwards all data in both directions between the client requesting the connection and the destination. It acts as the tunnel for this communication.
In some networks, the use of CONNECT method is restricted to some trusted sites. In such cases, an HTTP tunnel can still be implemented using only the usual HTTP methods as POST, GET, PUT and DELETE. In this case, the server runs outside the protected network and acts as a special HTTP server. The client program is run on a computer inside the protected network. Whenever any network traffic is passed to the client, it repackages it as an HTTP request and relays it to the outside server, which extracts and executes the original network request for the client. The response to the request, which was sent to the server is repackaged as an HTTP response and relayed back to the client. Since all traffic is encapsulated inside normal GET and POST requests and responses, this approach works through most proxies and firewalls.

Conclusion

Next time if you want to use your Internet applications safely despite restrictive firewalls and want an extra layer of protection against hackers, spyware, ID theft, then using HTTP tunnel may be the right option for you. IT Security Beginner: Certified Hacking Training course unfolds exciting facts on cyber threats and security.
Source: Udemy Blog

Friday, June 17, 2016

Advanced scripts - PIG Part 3


FILTER:
filter = FILTER climate by country MATCHES 'C.*a'; -> china
filter = FILTER climate by country MATCHES '.*(nda|hin).*; -> india, china

SPLIT:
SPLIT climate into B1992 if $1 == 1992, B2002 if $1 == 2002;
it split the climate bag into two records B1992 and B2002, contains 1992 and 2002 year data simuntaneously.
DUMP B1992;
DUMP B2002;

SAMPLE:
it generates sample data (random data) from the bag.
sample = SAMPLE climate 0.1; ->  it generates 1% of random data from climate.

ORDER:
order = ORDER climate BY year asc;

GROUP BY:
group = GROUP climate BY year;
group = GROUP climate BY (year, temp);

COGROUP:
it will combine the similar tuples into one group from 2 or more relations based on the grouping column.
cogroup = COGROUP a by a1, b by b1;
dump cogroup;

a1 a2 a3 b1 b2
1, 2, 3 4, 5
3, 5, 3 1, 3
2, 2, 7 5, 5
1, 4, 3 2, 5
2, 2, 3 1, 8

(1,3,2,1,2,4,1,5,2,1) --> (1,3,2,4,5)

(1,{(1,2,3),(1,4,3)},{(1,3),(1,8)})
(3,{(3,5,3)},{})
(2,{(2,2,7),(2,2,3)},{(2,5)})
(4,{},{(4,5)})
(5,{},{(5,5)})

cogroup_a_inner = COGROUP a BY a1 inner, b BY b1; --> atleast one tuple should be available in a.
(1,{(1,2,3),(1,4,3)},{(1,3),(1,8)})
(3,{(3,5,3)},{})
(2,{(2,2,7),(2,2,3)},{(2,5)})

cogroup_b_inner = COGROUP a BY a1, b BY b1 inner; --> atleast one tuple should be available in b.
(1,{(1,2,3),(1,4,3)},{(1,3),(1,8)})
(2,{(2,2,7),(2,2,3)},{(2,5)})
(4,{},{(4,5)})
(5,{},{(5,5)})

cogroup_a_b_inner = COGROUP a BY a1 inner, b BY b1 inner;
(1,{(1,2,3),(1,4,3)},{(1,3),(1,8)})
(2,{(2,2,7),(2,2,3)},{(2,5)})

TUPLE: 
store following data as tuple.
(1,2,3) (4,5,6)
(2,3,4) (5,6,7)

tuple = LOAD 'abc.txt' USING PigStorage(' ') AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));


BAG:
Store following data as Bag
{(1,2,3),(1,4,3)} {(1,3),(1,8)})
{(3,5,3)} {(3,5)})
{(2,2,7),(2,2,3)} {(2,5)})

bag1 = load 'data.txt' USING PigStorage(' ') AS (B1:Bag{t1:tuple(t1a:int,t1b:int,t1c:int)}, B2:Bag{t2:tuple(t2a:int,t2b:int)});


Introduction to PIG - Click here
Transformations or Operators in PIG - Click here

PIG Transformations or Operators - Part 2




LOAD:
Loading the data into “base”
grunt> base = load 'filepig.txt' using PigStorage('|') as (empid:int,ename:chararray,salary:int);

PigStorage(‘|’) è is used to separate the columns using delimiter ‘|’
The filepig.txt contains 3columns separated with delimiter “|”, that was done by PigStorage(‘|’) and name each column with title by using as (empid:int,ename:chararray,salary:int); ,so empid represents 1st column, ename represents 2nd column, salary represents 3rd column.

The above line load the data from filepig.txt into base operator.






We can check the operator description by using describe base;



foreach and generate: goes through the each column and generates results.
grunt> eid = foreach base generate empid;





DUMP: dump is used to get the result from the operators/variables.
grunt> dump base;
Result:



FILTER: filter is used to filter data, example filter the employees who has salary more than 4000.
grunt> moresal = filter base by salary==4000;

            
dump moresal;




ILLUSTRATE: shows schema how the data has been stored internally.
grunt> illustrate base;



EXPLAIN:
Used to explain logical plan, map and reducer plan and scope of the operands/variables.
grunt> explain emp;
2016-06-17 02:35:55,746 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-06-17 02:35:55,752 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-06-17 02:35:55,752 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
emp: (Name: LOStore Schema: eid#10443:int,name#10444:chararray,sal#10445:chararray)
|
|---emp: (Name: LOForEach Schema: eid#10443:int,name#10444:chararray,sal#10445:chararray)
    |   |
    |   (Name: LOGenerate[false,false,false] Schema: eid#10443:int,name#10444:chararray,sal#10445:chararray)ColumnPrune:OutputUids=[10443, 10444, 10445]ColumnPrune:InputUids=[10443, 10444, 10445]
    |   |   |
    |   |   (Name: Cast Type: int Uid: 10443)
    |   |   |
    |   |   |---eid:(Name: Project Type: bytearray Uid: 10443 Input: 0 Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 10444)
    |   |   |
    |   |   |---name:(Name: Project Type: bytearray Uid: 10444 Input: 1 Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 10445)
    |   |   |
    |   |   |---sal:(Name: Project Type: bytearray Uid: 10445 Input: 2 Column: (*))
    |   |
    |   |---(Name: LOInnerLoad[0] Schema: eid#10443:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[1] Schema: name#10444:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[2] Schema: sal#10445:bytearray)
    |
    |---emp: (Name: LOLoad Schema: eid#10443:bytearray,name#10444:bytearray,sal#10445:bytearray)RequiredFields:null
#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
emp: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-621
|
|---emp: New For Each(false,false,false)[bag] - scope-620
    |   |
    |   Cast[int] - scope-612
    |   |
    |   |---Project[bytearray][0] - scope-611
    |   |
    |   Cast[chararray] - scope-615
    |   |
    |   |---Project[bytearray][1] - scope-614
    |   |
    |   Cast[chararray] - scope-618
    |   |
    |   |---Project[bytearray][2] - scope-617
    |
    |---emp: Load(hdfs://localhost:9000/user/hadoop/emp.txt:PigStorage('|')) - scope-610

2016-06-17 02:35:55,757 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2016-06-17 02:35:55,759 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2016-06-17 02:35:55,759 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-622
Map Plan
emp: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-621
|
|---emp: New For Each(false,false,false)[bag] - scope-620
    |   |
    |   Cast[int] - scope-612
    |   |
    |   |---Project[bytearray][0] - scope-611
    |   |
    |   Cast[chararray] - scope-615
    |   |
    |   |---Project[bytearray][1] - scope-614
    |   |
    |   Cast[chararray] - scope-618
    |   |
    |   |---Project[bytearray][2] - scope-617
    |
    |---emp: Load(hdfs://localhost:9000/user/hadoop/emp.txt:PigStorage('|')) - scope-610--------
Global sort: false
----------------

Order By:
grunt> ordersalary = order base by salary;
grunt> dump ordersalary;




Group BY:
grunt> group_data = group emp by eid;
grunt> dump group_data;


grunt> describe group_data;
 



JOIN:

Let’s take another table, name it as dept.txt, and rename the old file as emp.txt, so it doesn’t confuse.
Move the dept and emp files to HDFS.
$ hadoop fs -put dept.txt
$ hadoop fs -put emp.txt


grunt> emp = load 'emp.txt' using PigStorage('|');
grunt> dept = load dept.txt' using PigStorage('|');




grunt> joindata = join emp by $0, dept by $1;
grunt> dump joindata;
here $0 represents, the first column in the emp Bag and $1 represents second column of Dept Bag.




Left Join ->
Left, right, outer joins will not be executed without well-defined bag schema. Example, the below left join will not work, but

The following left join will work, because it has schema.
grunt> emp = load 'emp.txt' using PigStorage('|') as (eid:int,name:chararray,sal:bytearray);
grunt> dept = load 'dept.txt' using PigStorage('|') as (did:int,eid:int,name:chararray);
grunt> ljoin = join emp by eid left,dept by eid;
grunt> dump ljoin;



Right Join:
grunt> rjoin = join emp by eid right,dept by eid;
grunt> dump rjoin;




STORE:

grunt> store rjoin into '/user/hadoop/rjoin_putput'



[hadoop@localhost blog]$ hadoop fs -cat /user/hadoop/rjoin_putput/part-r-00000


UNION:
Union is used to combine two similar bags.
Union_data = union emp,dept;

CROSS:
Is used to multiple the second operator/variable data with first operator/variable.
grunt> cross_data = cross emp,dept;
it multiple dept data by emp data.
grunt>dump cross_data;


LIMIT:
Is used to limit the output by number lines.
grunt> limit_data = limit cross_data 3;
grunt> dump limit_data;
it limit the output by 3 lines.




Flatten, AggFunctions, Distinct, Cogroup, Tokenizer

click here for PART 1 - What is PIG
click here for PART 3 - Advanced scripts