2016 ~ SsaiK

Monday, November 28, 2016

Aggregation functions in PIG

Example1:

year	price
2015	60
2014	45
2016	34
2014	75
2015	45
2014	41

I would like to get the result as, total price of the item by year.

grunt> file = load '/tmp/yearprice.txt' using PigStorage(',') as (year:int,price:float);

grunt> data = foreach file generate year, price by $0 > 1;

grunt> group = group data by year;

grunt> sum = foreach group generate group as year, SUM(data.price);

grunt> dump sum;

2014 161
2015 105
2016 34

Example 2:

id1, 1,on,400 

id1, 2,off,100

id2, 3,on,200

i would like to get the result as "sum of $3 if $2 is 0, by ID $0"

grunt> file = load '/tmp/file' using PigStorage(',');

grunt> refineData = foreach file generate $0, $1, (($2 == 'on') ? $3 : 0);

grunt> grp = group refineData by $0;

grunt> sum = foreach grp generate group as id, SUM(refineData.$2);

grunt> dump sum;

id1, 500
id2, 200

Hadoop / PIG / Tricks & Tips

No Comments

Wednesday, November 16, 2016

PIG left, right outer joins error

5:09 PM ssaikgame

We know that, we can perform JOIN operation using pig, as well as left outer join, right outer join. sometimes you'll face error when working of left and right outer joins, but not to inner joins. that is because of schema.

Even if there is no schema can perform inner join, but not for left and right outer joins.

Left, right, outer joins will not be executed without well-defined bag schema.

Left Join: Example, the below left join will not work, but

The following left join will work, because it has schema.

grunt> emp = load 'emp.txt' using PigStorage('|') as (eid:int,name:chararray,sal:bytearray);

grunt> dept = load 'dept.txt' using PigStorage('|') as (did:int,eid:int,name:chararray);

grunt> ljoin = join emp by eid left,dept by eid;

grunt> dump ljoin;

Right Join:

grunt> rjoin = join emp by eid right,dept by eid;

grunt> dump rjoin;

Hadoop / PIG / Tricks & Tips

No Comments

Monday, June 20, 2016

What is HTTP Tunneling - Connections Through Restrictions

3:34 PM ssaikgame

It is used to bypass firewalls and other network restrictions and an HTTP tunnel is used to create a direct network link between two locations.

But before proceeding further, let us look into the meaning of some of the terms we will use in this context.

Commonly Used Terms

HTTP

HTTP or Hyper Text Transfer Protocol is the network protocol or language used by web browsers to communicate with web servers. HTTP defines how messages should be formatted and transmitted, what actions web servers and browsers should take in response to various commands. For example, when you enter a URL as the web address, an HTTP command is sent to the web server directing it to fetch and transmit the requested web page. HTTP is called a stateless protocol because each command is executed independently, without any knowledge of the preceding commands.

Tunneling

Tunneling, also known as “port forwarding,” is the method of transmitting private network data and protocol information through public network by encapsulating the data. Tunneling is when instead of sending a packet directly through the network, the data is sent inside another encrypted connection. Tunneling can be achieved by nearly all protocols. A tunnel is used to ship a foreign protocol across a network that normally wouldn’t support it. You can take protocol A and wrap it or put it in a tunnel with protocol B. If you want to refresh your knowledge on TCP/IP models, Network Protocols, Network security, VPNs then CompTIA Network+ N10-005 is an excellent course to browse through.

Tunneling provides the basic underlying structure for setting up a VPN or Virtual Private Network. VPN is a network that is constructed by using public connections, usually the internet to connect to a private network, such as a company’s internal network. The process involves use of encryption and transmission protocols to create secure virtual tunnels for data transmission. This ensures that only authorized users can access the network and that the data cannot be intercepted. Data is transmitted in the form of packets over the Internet. The information contained in a data packet is called the payload and contains the routing information required to transmit the packet to a remote destination.

In VPN connection, a tunnel provides a secure medium for data exchanged between the corporate intranet, remote users, and networks of branch offices, suppliers, and business partners. The creation of a tunnel requires the following:

Carrier protocol: This is the network transport protocol to be used as the carrier protocol, for example PPP.
Encapsulation protocol: This protocol will encapsulate the payload of a data packet.
Passenger protocol: Refers to the protocol used by the by the data packets that are being transmitted through the tunnel. NetBEUI is an example of passenger protocol.

What is HTTP Tunneling?

HTTP tunneling is the process in which communications are encapsulated by using HTTP protocol. An HTTP tunnel is often used for network locations which have restricted connectivity or are behind firewalls or proxy servers. A firewall is typically a computer and software that sits between a group of client users and the wider outside Internet or intranet. The firewall is used to protect the internal client network from unauthorized access from outside the firewall.

Certain networks may have restricted connectivity in the form of blocked TCP/IP ports. Traffic is restricted from outside the network to secure it from internal and external threats and most network protocols are restricted except a few which are used for secured communication. If a user, for example, in a corporate environment has no permission to open TCP connections the user cannot use certain services or connect to the internet. In such a case, HTTP tunneling is a possible solution, when the protocol is encapsulated inside HTTP requests that can pass through the firewall or HTTP proxy. In HTTP tunneling, HTTP protocol acts as a wrapper for a channel that the network protocol being tunneled uses to communicate. HTTP tunnel software is used for this purpose which consists of client-server HTTP tunneling applications that integrate with existing application software, and allow them to communicate in restricted network connectivity. The application plays the role of a tunneling client. HTTP tunnel clients are used to access applications from behind restrictive firewalls or proxy servers, to access blocked sites, or to share confidential resource over HTTP securely.

How to Implement HTTP Tunneling?

When a HTTP connection is made through a proxy server, the client, which is usually the browser, sends the request to the proxy. The proxy opens the connection to the destination, sends the request, receives the response and sends it back to the client. The HTTP protocol specifies a request method called CONNECT. The CONNECT method can be used by the client to inform the proxy server that a connection to some host on some port is required. The proxy server tries to connect to the destination address specified in the requested header. If the operation fails, it sends a negative HTTP response back to the client and closes the connection. If the operation succeeds, it sends back an HTTP positive response and the connection is established. After that, the proxy server transmits and forwards all data in both directions between the client requesting the connection and the destination. It acts as the tunnel for this communication.

In some networks, the use of CONNECT method is restricted to some trusted sites. In such cases, an HTTP tunnel can still be implemented using only the usual HTTP methods as POST, GET, PUT and DELETE. In this case, the server runs outside the protected network and acts as a special HTTP server. The client program is run on a computer inside the protected network. Whenever any network traffic is passed to the client, it repackages it as an HTTP request and relays it to the outside server, which extracts and executes the original network request for the client. The response to the request, which was sent to the server is repackaged as an HTTP response and relayed back to the client. Since all traffic is encapsulated inside normal GET and POST requests and responses, this approach works through most proxies and firewalls.

Conclusion

Next time if you want to use your Internet applications safely despite restrictive firewalls and want an extra layer of protection against hackers, spyware, ID theft, then using HTTP tunnel may be the right option for you. IT Security Beginner: Certified Hacking Training course unfolds exciting facts on cyber threats and security.

Source: Udemy Blog

Linux / Networking / Technology / Tricks & Tips

No Comments

Friday, June 17, 2016

Advanced scripts - PIG Part 3

9:53 PM ssaikgame

FILTER:
filter = FILTER climate by country MATCHES 'C.*a'; -> china
filter = FILTER climate by country MATCHES '.*(nda|hin).*; -> india, china

SPLIT:
SPLIT climate into B1992 if $1 == 1992, B2002 if $1 == 2002;
it split the climate bag into two records B1992 and B2002, contains 1992 and 2002 year data simuntaneously.
DUMP B1992;
DUMP B2002;

SAMPLE:
it generates sample data (random data) from the bag.
sample = SAMPLE climate 0.1; -> it generates 1% of random data from climate.

ORDER:
order = ORDER climate BY year asc;

GROUP BY:
group = GROUP climate BY year;
group = GROUP climate BY (year, temp);

COGROUP:
it will combine the similar tuples into one group from 2 or more relations based on the grouping column.
cogroup = COGROUP a by a1, b by b1;
dump cogroup;

a1 a2 a3 b1 b2
1, 2, 3 4, 5
3, 5, 3 1, 3
2, 2, 7 5, 5
1, 4, 3 2, 5
2, 2, 3 1, 8

(1,3,2,1,2,4,1,5,2,1) --> (1,3,2,4,5)

(1,{(1,2,3),(1,4,3)},{(1,3),(1,8)})
(3,{(3,5,3)},{})
(2,{(2,2,7),(2,2,3)},{(2,5)})
(4,{},{(4,5)})
(5,{},{(5,5)})

cogroup_a_inner = COGROUP a BY a1 inner, b BY b1; --> atleast one tuple should be available in a.
(1,{(1,2,3),(1,4,3)},{(1,3),(1,8)})
(3,{(3,5,3)},{})
(2,{(2,2,7),(2,2,3)},{(2,5)})

cogroup_b_inner = COGROUP a BY a1, b BY b1 inner; --> atleast one tuple should be available in b.
(1,{(1,2,3),(1,4,3)},{(1,3),(1,8)})
(2,{(2,2,7),(2,2,3)},{(2,5)})
(4,{},{(4,5)})
(5,{},{(5,5)})

cogroup_a_b_inner = COGROUP a BY a1 inner, b BY b1 inner;
(1,{(1,2,3),(1,4,3)},{(1,3),(1,8)})
(2,{(2,2,7),(2,2,3)},{(2,5)})

TUPLE:
store following data as tuple.
(1,2,3) (4,5,6)
(2,3,4) (5,6,7)

tuple = LOAD 'abc.txt' USING PigStorage(' ') AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));

BAG:
Store following data as Bag
{(1,2,3),(1,4,3)} {(1,3),(1,8)})
{(3,5,3)} {(3,5)})
{(2,2,7),(2,2,3)} {(2,5)})

bag1 = load 'data.txt' USING PigStorage(' ') AS (B1:Bag{t1:tuple(t1a:int,t1b:int,t1c:int)}, B2:Bag{t2:tuple(t2a:int,t2b:int)});

Introduction to PIG - Click here

Transformations or Operators in PIG - Click here

College Material / Hadoop / PIG

No Comments

PIG Transformations or Operators - Part 2

2:51 AM ssaikgame

LOAD:

Loading the data into “base”

grunt> base = load 'filepig.txt' using PigStorage('|') as (empid:int,ename:chararray,salary:int);

PigStorage(‘|’) è is used to separate the columns using delimiter ‘|’

The filepig.txt contains 3columns separated with delimiter “|”, that was done by PigStorage(‘|’) and name each column with title by using as (empid:int,ename:chararray,salary:int); ,so empid represents 1^st column, ename represents 2^nd column, salary represents 3^rd column.

The above line load the data from filepig.txt into base operator.

We can check the operator description by using describe base;

foreach and generate: goes through the each column and generates results.

grunt> eid = foreach base generate empid;

DUMP: dump is used to get the result from the operators/variables.

grunt> dump base;

Result:

FILTER: filter is used to filter data, example filter the employees who has salary more than 4000.

grunt> moresal = filter base by salary==4000;

dump moresal;

ILLUSTRATE: shows schema how the data has been stored internally.

grunt> illustrate base;

EXPLAIN:

Used to explain logical plan, map and reducer plan and scope of the operands/variables.

grunt> explain emp;

2016-06-17 02:35:55,746 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2016-06-17 02:35:55,752 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized

2016-06-17 02:35:55,752 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}

#-----------------------------------------------

# New Logical Plan:

#-----------------------------------------------

emp: (Name: LOStore Schema: eid#10443:int,name#10444:chararray,sal#10445:chararray)

|---emp: (Name: LOForEach Schema: eid#10443:int,name#10444:chararray,sal#10445:chararray)

| |

| (Name: LOGenerate[false,false,false] Schema: eid#10443:int,name#10444:chararray,sal#10445:chararray)ColumnPrune:OutputUids=[10443, 10444, 10445]ColumnPrune:InputUids=[10443, 10444, 10445]

| | |

| | (Name: Cast Type: int Uid: 10443)

| | |

| | |---eid:(Name: Project Type: bytearray Uid: 10443 Input: 0 Column: (*))

| | |

| | (Name: Cast Type: chararray Uid: 10444)

| | |

| | |---name:(Name: Project Type: bytearray Uid: 10444 Input: 1 Column: (*))

| | |

| | (Name: Cast Type: chararray Uid: 10445)

| | |

| | |---sal:(Name: Project Type: bytearray Uid: 10445 Input: 2 Column: (*))

| |

| |---(Name: LOInnerLoad[0] Schema: eid#10443:bytearray)

| |

| |---(Name: LOInnerLoad[1] Schema: name#10444:bytearray)

| |

| |---(Name: LOInnerLoad[2] Schema: sal#10445:bytearray)

|---emp: (Name: LOLoad Schema: eid#10443:bytearray,name#10444:bytearray,sal#10445:bytearray)RequiredFields:null

#-----------------------------------------------

# Physical Plan:

#-----------------------------------------------

emp: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-621

|---emp: New For Each(false,false,false)[bag] - scope-620

| |

| Cast[int] - scope-612

| |

| |---Project[bytearray][0] - scope-611

| |

| Cast[chararray] - scope-615

| |

| |---Project[bytearray][1] - scope-614

| |

| Cast[chararray] - scope-618

| |

| |---Project[bytearray][2] - scope-617

|---emp: Load(hdfs://localhost:9000/user/hadoop/emp.txt:PigStorage('|')) - scope-610

2016-06-17 02:35:55,757 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false

2016-06-17 02:35:55,759 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1

2016-06-17 02:35:55,759 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1

#--------------------------------------------------

# Map Reduce Plan

#--------------------------------------------------

MapReduce node scope-622

Map Plan

emp: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-621

|---emp: New For Each(false,false,false)[bag] - scope-620

| |

| Cast[int] - scope-612

| |

| |---Project[bytearray][0] - scope-611

| |

| Cast[chararray] - scope-615

| |

| |---Project[bytearray][1] - scope-614

| |

| Cast[chararray] - scope-618

| |

| |---Project[bytearray][2] - scope-617

|---emp: Load(hdfs://localhost:9000/user/hadoop/emp.txt:PigStorage('|')) - scope-610--------

Global sort: false

----------------

Order By:

grunt> ordersalary = order base by salary;

grunt> dump ordersalary;

Group BY:

grunt> group_data = group emp by eid;

grunt> dump group_data;

grunt> describe group_data;

JOIN:

Let’s take another table, name it as dept.txt, and rename the old file as emp.txt, so it doesn’t confuse.

Move the dept and emp files to HDFS.

$ hadoop fs -put dept.txt

$ hadoop fs -put emp.txt

grunt> emp = load 'emp.txt' using PigStorage('|');

grunt> dept = load dept.txt' using PigStorage('|');

grunt> joindata = join emp by $0, dept by $1;

grunt> dump joindata;

here $0 represents, the first column in the emp Bag and $1 represents second column of Dept Bag.

Left Join ->

Left, right, outer joins will not be executed without well-defined bag schema. Example, the below left join will not work, but

The following left join will work, because it has schema.

grunt> emp = load 'emp.txt' using PigStorage('|') as (eid:int,name:chararray,sal:bytearray);

grunt> dept = load 'dept.txt' using PigStorage('|') as (did:int,eid:int,name:chararray);

grunt> ljoin = join emp by eid left,dept by eid;

grunt> dump ljoin;

Right Join:

grunt> rjoin = join emp by eid right,dept by eid;

grunt> dump rjoin;

STORE:

grunt> store rjoin into '/user/hadoop/rjoin_putput'

[hadoop@localhost blog]$ hadoop fs -cat /user/hadoop/rjoin_putput/part-r-00000

UNION:

Union is used to combine two similar bags.

Union_data = union emp,dept;

CROSS:

Is used to multiple the second operator/variable data with first operator/variable.

grunt> cross_data = cross emp,dept;

it multiple dept data by emp data.

grunt>dump cross_data;

LIMIT:

Is used to limit the output by number lines.

grunt> limit_data = limit cross_data 3;

grunt> dump limit_data;

it limit the output by 3 lines.

Flatten, AggFunctions, Distinct, Cogroup, Tokenizer

click here for PART 1 - What is PIG
click here for PART 3 - Advanced scripts

College Material / Hadoop / PIG

No Comments

About

Monday, November 28, 2016

Aggregation functions in PIG

Wednesday, November 16, 2016

PIG left, right outer joins error

Monday, June 20, 2016

What is HTTP Tunneling - Connections Through Restrictions

Commonly Used Terms

HTTP

Tunneling

What is HTTP Tunneling?

How to Implement HTTP Tunneling?

Conclusion

Friday, June 17, 2016

Advanced scripts - PIG Part 3

PIG Transformations or Operators - Part 2

Popular Posts

Categories

Blog Archive

Blogroll