1. A SQL like scripting language for
Hadoop
CIS 210 – February 2013
Highline Community College
2. Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data
analysis programs, coupled with infrastructure for
evaluating these programs. The salient property of Pig
programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very
large data sets.
3. At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of Map-Reduce programs,
for which large-scale parallel implementations already exist
(e.g., the Hadoop subproject). Pig's language layer currently
consists of a textual language called Pig Latin, which has the
following key properties:
Ease of programming. It is trivial to achieve parallel
execution of simple, "embarrassingly parallel" data analysis
tasks. Complex tasks comprised of multiple interrelated
data transformations are explicitly encoded as data flow
sequences, making them easy to write, understand, and
maintain.
Optimization opportunities. The way in which tasks are
encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics
rather than efficiency.
Extensibility. Users can create their own functions to do
special-purpose processing.
4. Amazon Web Services has Hadoop and will support PIG as
part of the Hadoop infrastructure of “Elastic Map Reduce”.
Sample Pig Script:
s3://elasticmapreduce/samples/pig-apache/do-
reports2.pig
Sample Dataset:
s3://elasticmapreduce/samples/pig-apache/input
5. Local Mode - To run Pig in local mode, you need access to a
single machine; all files are installed and run using your
local host and file system. Specify local mode using the -x
flag (pig -x local).
Mapreduce Mode - To run Pig in mapreduce mode, you
need access to a Hadoop cluster and HDFS installation.
Mapreduce mode is the default mode; you can, but don't
need to, specify it using the -x flag (pig OR pig -xmapreduce).
6. Interactive Mode
You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown
below) and then enter your Pig Latin statements and Pig commands interactively at the command line.
Batch Mode
You can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode).
Example
The Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the
/etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local or
mapreduce mode). The STORE operator will write the results to a file (id.out).
7. There are two types of job flows supported with Pig:
interactive and batch.
In an interactive mode a customer can start a job flow and
run Pig scripts interactively directly on the master node.
Typically, this mode is used to do ad hoc data analyses and
for application development.
In batch mode, the Pig script is stored in Amazon S3 and is
referenced at the start of the job flow. Typically, batch mode
is used for repeatable runs such as report generation.
9. --
-- import logs and break into tuples
--
raw_logs =
-- load the weblogs into a sequence of one element tuples
LOAD '$INPUT' USING TextLoader AS (line:chararray);
logs_base =
-- for each weblog string convert the weblong string into a
-- structure with named fields
FOREACH
raw_logs
GENERATE
FLATTEN (
EXTRACT(
line,
'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)"
"([^"]*)"'
)
)
AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
request: chararray, status: int, bytes_string: chararray, referrer: chararray,
browser: chararray
)
;
10. What is a Tuple?
In mathematics and computer science, a tuple is an ordered list of
elements. In set theory, an (ordered) -tuple is a sequence (or ordered
list) of elements, where is a non-negative integer. There is only one 0-
tuple, an empty sequence.
An -tuple is defined inductively using the construction of an ordered
pair. Tuples are usually written by listing the elements within
parentheses "" and separated by commas; for example, denotes a 5-
tuple. Sometimes other delimiters are used, such as square brackets ""
or angle brackets "". Braces "" are almost never used for tuples, as they
are the standard notation for sets.
Tuples are often used to describe other mathematical objects, such as
vectors. In computer science, tuples are directly implemented as
product types in most functional programming languages. More
commonly, they are implemented as record types, where the
components are labeled instead of being identified by position alone.
This approach is also used in relational algebra.
11. This is a regular expression:
'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})]
"(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"’
Regular expressions can be used to parse data out of a file,
or used to validate data in SQL or other programming
languages. We will focus on SQL because PIG is very similar
to SQL
12. This is a little hard to read because of the wrapping. What you
should see is that Pig is loading the line into a tuple with just a
single element --- the line itself. You now need to split the line
into fields. To do this, use the EXTRACT Piggybank function,
which applies a regular expression to the input and extracts the
matched groups as elements of a tuple. The regular expression
is a little tricky because the Apache log defines a couple of
fields with quotes.
Unfortunately, you can't use this as is because in Pig strings all
backslashes must be escaped with a backslash. Making the
regular expression a little bulky in relationship to use in other
programming languages.
'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})]
"(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"'
13. logs_base =
-- for each weblog string convert the weblong string into a
-- structure with named fields
FOREACH
raw_logs
GENERATE
FLATTEN (
EXTRACT(
line,
'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)
"([^"]*)" "([^"]*)"'
)
)
AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
request: chararray, status: int, bytes_string: chararray, referrer: chararray,
browser: chararray
)
;
14. logs =
-- convert from string values to typed values such as date_time
and integers
FOREACH
logs_base
GENERATE
*,
DATE_TIME(time, 'dd/MMM/yyyy:HH:mm:ss Z', 'UTC') as
datetime,
(int)REPLACE(bytes_string, '-', '0') as bytes
;
15. --
-- determine total number of requests and bytes served by UTC hour of day
-- aggregating as a typical day across the total time of the logs
--
by_hour_count =
-- group logs by their hour of day, counting the number of logs in that
hour
-- and the sum of the bytes of rows for that hour
FOREACH
(GROUP logs BY FORMAT_DT('HH',datetime))
GENERATE
$0,
COUNT($1) AS num_requests,
SUM($1.bytes) AS num_bytes
;
STORE by_hour_count INTO '$OUTPUT/total_requests_bytes_per_hour';
16. --
-- top 50 X.X.X.* blocks
--
by_ip_count =
-- group weblog entries by the ip address from the remote address field
-- and count the number of entries for each address as well as
-- the sum of the bytes
FOREACH
(GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr,
'(d+.d+.d+)')))
GENERATE
$0,
COUNT($1) AS num_requests,
SUM($1.bytes) AS num_bytes
;
by_ip_count_sorted =
-- order ip by the number of requests they make
LIMIT (ORDER by_ip_count BY num_requests DESC) 50;
STORE by_ip_count_sorted into '$OUTPUT/top_50_ips';
17. -- top 50 external referrers
--
by_referrer_count =
-- group by the referrer URL and count the number of requests
FOREACH
(GROUP logs BY EXTRACT(referrer, '(http://[a-z0-9.-]+)'))
GENERATE
FLATTEN($0),
COUNT($1) AS num_requests
;
by_referrer_count_filtered =
-- exclude matches for example.org
FILTER by_referrer_count BY NOT $0 matches '.*example.org';
by_referrer_count_sorted =
-- take the top 50 results
LIMIT (ORDER by_referrer_count_filtered BY num_requests DESC) 50;
STORE by_referrer_count_sorted INTO '$OUTPUT/top_50_external_referrers';
18. -- top search terms coming from bing or google
--
google_and_bing_urls =
-- find referrer fields that match either bing or google
FILTER
(FOREACH logs GENERATE referrer)
BY
referrer matches '.*bing.*'
OR
referrer matches '.*google.*'
;
search_terms =
-- extract from each referrer url the search phrases
FOREACH
google_and_bing_urls
GENERATE
FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as (term:chararray)
;
search_terms_filtered =
-- reject urls that contained no search terms
FILTER search_terms BY NOT $0 IS NULL;
search_terms_count =
-- for each search phrase count the number of weblogs entries that contained it
FOREACH
(GROUP search_terms_filtered BY $0)
GENERATE
$0,
COUNT($1) AS num
;
search_terms_count_sorted =
-- take the top 50 results
LIMIT (ORDER search_terms_count BY num DESC) 50;
STORE search_terms_count_sorted INTO '$OUTPUT/top_50_search_terms_from_bing_google';
19. (GROUP logs BY EXTRACT(referrer, '(http://[a-z0-
9.-]+)'))
(GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr,
'(d+.d+.d+)')))
FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as
(term:chararray)
Learning regular expressions will help you with scripting