SlideShare a Scribd company logo
1 of 21
A SQL like scripting language for
            Hadoop

   CIS 210 – February 2013
 Highline Community College
Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data
analysis programs, coupled with infrastructure for
evaluating these programs. The salient property of Pig
programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very
large data sets.
At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of Map-Reduce programs,
for which large-scale parallel implementations already exist
(e.g., the Hadoop subproject). Pig's language layer currently
consists of a textual language called Pig Latin, which has the
following key properties:
    Ease of programming. It is trivial to achieve parallel
      execution of simple, "embarrassingly parallel" data analysis
      tasks. Complex tasks comprised of multiple interrelated
      data transformations are explicitly encoded as data flow
      sequences, making them easy to write, understand, and
      maintain.
    Optimization opportunities. The way in which tasks are
      encoded permits the system to optimize their execution
      automatically, allowing the user to focus on semantics
      rather than efficiency.
    Extensibility. Users can create their own functions to do
      special-purpose processing.
Amazon Web Services has Hadoop and will support PIG as
part of the Hadoop infrastructure of “Elastic Map Reduce”.

Sample Pig Script:
s3://elasticmapreduce/samples/pig-apache/do-
reports2.pig

Sample Dataset:
s3://elasticmapreduce/samples/pig-apache/input
Local Mode - To run Pig in local mode, you need access to a
single machine; all files are installed and run using your
local host and file system. Specify local mode using the -x
flag (pig -x local).

Mapreduce Mode - To run Pig in mapreduce mode, you
need access to a Hadoop cluster and HDFS installation.
Mapreduce mode is the default mode; you can, but don't
need to, specify it using the -x flag (pig OR pig -xmapreduce).
Interactive Mode
You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown
below) and then enter your Pig Latin statements and Pig commands interactively at the command line.

Batch Mode
You can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode).
Example
The Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the
/etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local or
mapreduce mode). The STORE operator will write the results to a file (id.out).
There are two types of job flows supported with Pig:
interactive and batch.

In an interactive mode a customer can start a job flow and
run Pig scripts interactively directly on the master node.
Typically, this mode is used to do ad hoc data analyses and
for application development.

In batch mode, the Pig script is stored in Amazon S3 and is
referenced at the start of the job flow. Typically, batch mode
is used for repeatable runs such as report generation.
--
-- setup piggyback functions
--
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT
org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE FORMAT
org.apache.pig.piggybank.evaluation.string.FORMAT();
DEFINE REPLACE
org.apache.pig.piggybank.evaluation.string.REPLACE();
DEFINE DATE_TIME
org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE FORMAT_DT
org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
--
-- import logs and break into tuples
--
raw_logs =
 -- load the weblogs into a sequence of one element tuples
 LOAD '$INPUT' USING TextLoader AS (line:chararray);

logs_base =
 -- for each weblog string convert the weblong string into a
 -- structure with named fields
 FOREACH
raw_logs
 GENERATE
   FLATTEN (
     EXTRACT(
       line,
       '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)"
"([^"]*)"'
     )
   )
   AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
     request: chararray, status: int, bytes_string: chararray, referrer: chararray,
     browser: chararray
   )
 ;
What is a Tuple?

In mathematics and computer science, a tuple is an ordered list of
elements. In set theory, an (ordered) -tuple is a sequence (or ordered
list) of elements, where is a non-negative integer. There is only one 0-
tuple, an empty sequence.
An -tuple is defined inductively using the construction of an ordered
pair. Tuples are usually written by listing the elements within
parentheses "" and separated by commas; for example, denotes a 5-
tuple. Sometimes other delimiters are used, such as square brackets ""
or angle brackets "". Braces "" are almost never used for tuples, as they
are the standard notation for sets.
Tuples are often used to describe other mathematical objects, such as
vectors. In computer science, tuples are directly implemented as
product types in most functional programming languages. More
commonly, they are implemented as record types, where the
components are labeled instead of being identified by position alone.
This approach is also used in relational algebra.
This is a regular expression:

 '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})]
"(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"’

Regular expressions can be used to parse data out of a file,
or used to validate data in SQL or other programming
languages. We will focus on SQL because PIG is very similar
to SQL
This is a little hard to read because of the wrapping. What you
should see is that Pig is loading the line into a tuple with just a
single element --- the line itself. You now need to split the line
into fields. To do this, use the EXTRACT Piggybank function,
which applies a regular expression to the input and extracts the
matched groups as elements of a tuple. The regular expression
is a little tricky because the Apache log defines a couple of
fields with quotes.

Unfortunately, you can't use this as is because in Pig strings all
backslashes must be escaped with a backslash. Making the
regular expression a little bulky in relationship to use in other
programming languages.

'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})]
"(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"'
logs_base =
 -- for each weblog string convert the weblong string into a
 -- structure with named fields
 FOREACH
raw_logs
 GENERATE
   FLATTEN (
     EXTRACT(
       line,
       '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)
"([^"]*)" "([^"]*)"'
     )
   )
   AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
     request: chararray, status: int, bytes_string: chararray, referrer: chararray,
     browser: chararray
   )
 ;
logs =
 -- convert from string values to typed values such as date_time
and integers
 FOREACH
logs_base
 GENERATE
   *,
DATE_TIME(time, 'dd/MMM/yyyy:HH:mm:ss Z', 'UTC') as
datetime,
   (int)REPLACE(bytes_string, '-', '0')    as bytes
 ;
--
-- determine total number of requests and bytes served by UTC hour of day
-- aggregating as a typical day across the total time of the logs
--
by_hour_count =
 -- group logs by their hour of day, counting the number of logs in that
hour
 -- and the sum of the bytes of rows for that hour
 FOREACH
   (GROUP logs BY FORMAT_DT('HH',datetime))
 GENERATE
   $0,
   COUNT($1) AS num_requests,
   SUM($1.bytes) AS num_bytes
 ;

STORE by_hour_count INTO '$OUTPUT/total_requests_bytes_per_hour';
--
-- top 50 X.X.X.* blocks
--
by_ip_count =
  -- group weblog entries by the ip address from the remote address field
  -- and count the number of entries for each address as well as
  -- the sum of the bytes
  FOREACH
    (GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr,
'(d+.d+.d+)')))
  GENERATE
    $0,
    COUNT($1) AS num_requests,
    SUM($1.bytes) AS num_bytes
  ;

by_ip_count_sorted =
 -- order ip by the number of requests they make
 LIMIT (ORDER by_ip_count BY num_requests DESC) 50;

STORE by_ip_count_sorted into '$OUTPUT/top_50_ips';
-- top 50 external referrers
--
by_referrer_count =
 -- group by the referrer URL and count the number of requests
 FOREACH
   (GROUP logs BY EXTRACT(referrer, '(http://[a-z0-9.-]+)'))
 GENERATE
   FLATTEN($0),
   COUNT($1) AS num_requests
 ;

by_referrer_count_filtered =
 -- exclude matches for example.org
 FILTER by_referrer_count BY NOT $0 matches '.*example.org';

by_referrer_count_sorted =
 -- take the top 50 results
 LIMIT (ORDER by_referrer_count_filtered BY num_requests DESC) 50;

STORE by_referrer_count_sorted INTO '$OUTPUT/top_50_external_referrers';
-- top search terms coming from bing or google
--
google_and_bing_urls =
 -- find referrer fields that match either bing or google
 FILTER
   (FOREACH logs GENERATE referrer)
 BY
   referrer matches '.*bing.*'
 OR
   referrer matches '.*google.*'
 ;

search_terms =
 -- extract from each referrer url the search phrases
 FOREACH
google_and_bing_urls
 GENERATE
FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as (term:chararray)
 ;

search_terms_filtered =
 -- reject urls that contained no search terms
 FILTER search_terms BY NOT $0 IS NULL;

search_terms_count =
 -- for each search phrase count the number of weblogs entries that contained it
 FOREACH
   (GROUP search_terms_filtered BY $0)
 GENERATE
   $0,
   COUNT($1) AS num
 ;

search_terms_count_sorted =
 -- take the top 50 results
 LIMIT (ORDER search_terms_count BY num DESC) 50;


STORE search_terms_count_sorted INTO '$OUTPUT/top_50_search_terms_from_bing_google';
(GROUP logs BY EXTRACT(referrer, '(http://[a-z0-
9.-]+)'))

(GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr,
'(d+.d+.d+)')))

FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as
(term:chararray)

Learning regular expressions will help you with scripting
https://www.owasp.org/index.php
/Input_Validation_Cheat_Sheet

http://www.regular-
expressions.info/
AWS Hadoop and PIG and overview

More Related Content

What's hot

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)Bopyo Hong
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat SheetHortonworks
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerdeZheng Shao
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Pl python python w postgre-sql
Pl python   python w postgre-sqlPl python   python w postgre-sql
Pl python python w postgre-sqlPiotr Pałkiewicz
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamZheng Shao
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Rohit Agrawal
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 

What's hot (20)

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 
Hive commands
Hive commandsHive commands
Hive commands
 
Unit 4 lecture-3
Unit 4 lecture-3Unit 4 lecture-3
Unit 4 lecture-3
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Pl python python w postgre-sql
Pl python   python w postgre-sqlPl python   python w postgre-sql
Pl python python w postgre-sql
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Sql cheat sheet
Sql cheat sheetSql cheat sheet
Sql cheat sheet
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Apache pig
Apache pigApache pig
Apache pig
 

Similar to AWS Hadoop and PIG and overview

Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
On secure application of PHP wrappers
On secure application  of PHP wrappersOn secure application  of PHP wrappers
On secure application of PHP wrappersPositive Hack Days
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Pxb For Yapc2008
Pxb For Yapc2008Pxb For Yapc2008
Pxb For Yapc2008maximgrp
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
 
Supercharging WordPress Development in 2018
Supercharging WordPress Development in 2018Supercharging WordPress Development in 2018
Supercharging WordPress Development in 2018Adam Tomat
 
Mastering the AWS SDK for PHP (TLS306) | AWS re:Invent 2013
Mastering the AWS SDK for PHP (TLS306) | AWS re:Invent 2013Mastering the AWS SDK for PHP (TLS306) | AWS re:Invent 2013
Mastering the AWS SDK for PHP (TLS306) | AWS re:Invent 2013Amazon Web Services
 
How to make the fastest Router in Python
How to make the fastest Router in PythonHow to make the fastest Router in Python
How to make the fastest Router in Pythonkwatch
 
Fluentd unified logging layer
Fluentd   unified logging layerFluentd   unified logging layer
Fluentd unified logging layerKiyoto Tamura
 
Terraform in deployment pipeline
Terraform in deployment pipelineTerraform in deployment pipeline
Terraform in deployment pipelineAnton Babenko
 

Similar to AWS Hadoop and PIG and overview (20)

06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
On secure application of PHP wrappers
On secure application  of PHP wrappersOn secure application  of PHP wrappers
On secure application of PHP wrappers
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Pxb For Yapc2008
Pxb For Yapc2008Pxb For Yapc2008
Pxb For Yapc2008
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Supercharging WordPress Development in 2018
Supercharging WordPress Development in 2018Supercharging WordPress Development in 2018
Supercharging WordPress Development in 2018
 
Mastering the AWS SDK for PHP (TLS306) | AWS re:Invent 2013
Mastering the AWS SDK for PHP (TLS306) | AWS re:Invent 2013Mastering the AWS SDK for PHP (TLS306) | AWS re:Invent 2013
Mastering the AWS SDK for PHP (TLS306) | AWS re:Invent 2013
 
How to make the fastest Router in Python
How to make the fastest Router in PythonHow to make the fastest Router in Python
How to make the fastest Router in Python
 
Fluentd unified logging layer
Fluentd   unified logging layerFluentd   unified logging layer
Fluentd unified logging layer
 
Terraform in deployment pipeline
Terraform in deployment pipelineTerraform in deployment pipeline
Terraform in deployment pipeline
 

More from Dan Morrill

Windows power shell and active directory
Windows power shell and active directoryWindows power shell and active directory
Windows power shell and active directoryDan Morrill
 
Windows power shell basics
Windows power shell basicsWindows power shell basics
Windows power shell basicsDan Morrill
 
Understanding web site analytics
Understanding web site analyticsUnderstanding web site analytics
Understanding web site analyticsDan Morrill
 
Process monitoring in UNIX shell scripting
Process monitoring in UNIX shell scriptingProcess monitoring in UNIX shell scripting
Process monitoring in UNIX shell scriptingDan Morrill
 
Creating a keystroke logger in unix shell scripting
Creating a keystroke logger in unix shell scriptingCreating a keystroke logger in unix shell scripting
Creating a keystroke logger in unix shell scriptingDan Morrill
 
Understanding UNIX CASE and TPUT
Understanding UNIX CASE and TPUTUnderstanding UNIX CASE and TPUT
Understanding UNIX CASE and TPUTDan Morrill
 
Information security principles
Information security principlesInformation security principles
Information security principlesDan Morrill
 
Using Regular Expressions in Grep
Using Regular Expressions in GrepUsing Regular Expressions in Grep
Using Regular Expressions in GrepDan Morrill
 
Understanding the security_organization
Understanding the security_organizationUnderstanding the security_organization
Understanding the security_organizationDan Morrill
 
You should ask before copying that media
You should ask before copying that mediaYou should ask before copying that media
You should ask before copying that mediaDan Morrill
 
Cis 216 – shell scripting
Cis 216 – shell scriptingCis 216 – shell scripting
Cis 216 – shell scriptingDan Morrill
 
Understanding advanced persistent threats (APT)
Understanding advanced persistent threats (APT)Understanding advanced persistent threats (APT)
Understanding advanced persistent threats (APT)Dan Morrill
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computingDan Morrill
 
Social Media Plan for CityU of Seattle
Social Media Plan for CityU of SeattleSocial Media Plan for CityU of Seattle
Social Media Plan for CityU of SeattleDan Morrill
 
Case Studies In Social Media Chinese
Case Studies In Social Media ChineseCase Studies In Social Media Chinese
Case Studies In Social Media ChineseDan Morrill
 
Case Studies In Social Media
Case Studies In Social MediaCase Studies In Social Media
Case Studies In Social MediaDan Morrill
 
Turn On Tune In Step Out
Turn On Tune In Step OutTurn On Tune In Step Out
Turn On Tune In Step OutDan Morrill
 
Technology And The Future Of Management
Technology And The Future Of ManagementTechnology And The Future Of Management
Technology And The Future Of ManagementDan Morrill
 

More from Dan Morrill (19)

Windows power shell and active directory
Windows power shell and active directoryWindows power shell and active directory
Windows power shell and active directory
 
Windows power shell basics
Windows power shell basicsWindows power shell basics
Windows power shell basics
 
Understanding web site analytics
Understanding web site analyticsUnderstanding web site analytics
Understanding web site analytics
 
Process monitoring in UNIX shell scripting
Process monitoring in UNIX shell scriptingProcess monitoring in UNIX shell scripting
Process monitoring in UNIX shell scripting
 
Creating a keystroke logger in unix shell scripting
Creating a keystroke logger in unix shell scriptingCreating a keystroke logger in unix shell scripting
Creating a keystroke logger in unix shell scripting
 
Understanding UNIX CASE and TPUT
Understanding UNIX CASE and TPUTUnderstanding UNIX CASE and TPUT
Understanding UNIX CASE and TPUT
 
Information security principles
Information security principlesInformation security principles
Information security principles
 
Using Regular Expressions in Grep
Using Regular Expressions in GrepUsing Regular Expressions in Grep
Using Regular Expressions in Grep
 
Understanding the security_organization
Understanding the security_organizationUnderstanding the security_organization
Understanding the security_organization
 
You should ask before copying that media
You should ask before copying that mediaYou should ask before copying that media
You should ask before copying that media
 
Cis 216 – shell scripting
Cis 216 – shell scriptingCis 216 – shell scripting
Cis 216 – shell scripting
 
Understanding advanced persistent threats (APT)
Understanding advanced persistent threats (APT)Understanding advanced persistent threats (APT)
Understanding advanced persistent threats (APT)
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computing
 
Social Media Plan for CityU of Seattle
Social Media Plan for CityU of SeattleSocial Media Plan for CityU of Seattle
Social Media Plan for CityU of Seattle
 
BSIS Overview
BSIS OverviewBSIS Overview
BSIS Overview
 
Case Studies In Social Media Chinese
Case Studies In Social Media ChineseCase Studies In Social Media Chinese
Case Studies In Social Media Chinese
 
Case Studies In Social Media
Case Studies In Social MediaCase Studies In Social Media
Case Studies In Social Media
 
Turn On Tune In Step Out
Turn On Tune In Step OutTurn On Tune In Step Out
Turn On Tune In Step Out
 
Technology And The Future Of Management
Technology And The Future Of ManagementTechnology And The Future Of Management
Technology And The Future Of Management
 

Recently uploaded

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 

Recently uploaded (20)

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 

AWS Hadoop and PIG and overview

  • 1. A SQL like scripting language for Hadoop CIS 210 – February 2013 Highline Community College
  • 2. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
  • 3. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing.
  • 4. Amazon Web Services has Hadoop and will support PIG as part of the Hadoop infrastructure of “Elastic Map Reduce”. Sample Pig Script: s3://elasticmapreduce/samples/pig-apache/do- reports2.pig Sample Dataset: s3://elasticmapreduce/samples/pig-apache/input
  • 5. Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local). Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -xmapreduce).
  • 6. Interactive Mode You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line. Batch Mode You can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode). Example The Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local or mapreduce mode). The STORE operator will write the results to a file (id.out).
  • 7. There are two types of job flows supported with Pig: interactive and batch. In an interactive mode a customer can start a job flow and run Pig scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Pig script is stored in Amazon S3 and is referenced at the start of the job flow. Typically, batch mode is used for repeatable runs such as report generation.
  • 8. -- -- setup piggyback functions -- register file:/home/hadoop/lib/pig/piggybank.jar DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT(); DEFINE REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE(); DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME(); DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
  • 9. -- -- import logs and break into tuples -- raw_logs = -- load the weblogs into a sequence of one element tuples LOAD '$INPUT' USING TextLoader AS (line:chararray); logs_base = -- for each weblog string convert the weblong string into a -- structure with named fields FOREACH raw_logs GENERATE FLATTEN ( EXTRACT( line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"' ) ) AS ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray ) ;
  • 10. What is a Tuple? In mathematics and computer science, a tuple is an ordered list of elements. In set theory, an (ordered) -tuple is a sequence (or ordered list) of elements, where is a non-negative integer. There is only one 0- tuple, an empty sequence. An -tuple is defined inductively using the construction of an ordered pair. Tuples are usually written by listing the elements within parentheses "" and separated by commas; for example, denotes a 5- tuple. Sometimes other delimiters are used, such as square brackets "" or angle brackets "". Braces "" are almost never used for tuples, as they are the standard notation for sets. Tuples are often used to describe other mathematical objects, such as vectors. In computer science, tuples are directly implemented as product types in most functional programming languages. More commonly, they are implemented as record types, where the components are labeled instead of being identified by position alone. This approach is also used in relational algebra.
  • 11. This is a regular expression: '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"’ Regular expressions can be used to parse data out of a file, or used to validate data in SQL or other programming languages. We will focus on SQL because PIG is very similar to SQL
  • 12. This is a little hard to read because of the wrapping. What you should see is that Pig is loading the line into a tuple with just a single element --- the line itself. You now need to split the line into fields. To do this, use the EXTRACT Piggybank function, which applies a regular expression to the input and extracts the matched groups as elements of a tuple. The regular expression is a little tricky because the Apache log defines a couple of fields with quotes. Unfortunately, you can't use this as is because in Pig strings all backslashes must be escaped with a backslash. Making the regular expression a little bulky in relationship to use in other programming languages. '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"'
  • 13. logs_base = -- for each weblog string convert the weblong string into a -- structure with named fields FOREACH raw_logs GENERATE FLATTEN ( EXTRACT( line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"' ) ) AS ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray ) ;
  • 14. logs = -- convert from string values to typed values such as date_time and integers FOREACH logs_base GENERATE *, DATE_TIME(time, 'dd/MMM/yyyy:HH:mm:ss Z', 'UTC') as datetime, (int)REPLACE(bytes_string, '-', '0') as bytes ;
  • 15. -- -- determine total number of requests and bytes served by UTC hour of day -- aggregating as a typical day across the total time of the logs -- by_hour_count = -- group logs by their hour of day, counting the number of logs in that hour -- and the sum of the bytes of rows for that hour FOREACH (GROUP logs BY FORMAT_DT('HH',datetime)) GENERATE $0, COUNT($1) AS num_requests, SUM($1.bytes) AS num_bytes ; STORE by_hour_count INTO '$OUTPUT/total_requests_bytes_per_hour';
  • 16. -- -- top 50 X.X.X.* blocks -- by_ip_count = -- group weblog entries by the ip address from the remote address field -- and count the number of entries for each address as well as -- the sum of the bytes FOREACH (GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr, '(d+.d+.d+)'))) GENERATE $0, COUNT($1) AS num_requests, SUM($1.bytes) AS num_bytes ; by_ip_count_sorted = -- order ip by the number of requests they make LIMIT (ORDER by_ip_count BY num_requests DESC) 50; STORE by_ip_count_sorted into '$OUTPUT/top_50_ips';
  • 17. -- top 50 external referrers -- by_referrer_count = -- group by the referrer URL and count the number of requests FOREACH (GROUP logs BY EXTRACT(referrer, '(http://[a-z0-9.-]+)')) GENERATE FLATTEN($0), COUNT($1) AS num_requests ; by_referrer_count_filtered = -- exclude matches for example.org FILTER by_referrer_count BY NOT $0 matches '.*example.org'; by_referrer_count_sorted = -- take the top 50 results LIMIT (ORDER by_referrer_count_filtered BY num_requests DESC) 50; STORE by_referrer_count_sorted INTO '$OUTPUT/top_50_external_referrers';
  • 18. -- top search terms coming from bing or google -- google_and_bing_urls = -- find referrer fields that match either bing or google FILTER (FOREACH logs GENERATE referrer) BY referrer matches '.*bing.*' OR referrer matches '.*google.*' ; search_terms = -- extract from each referrer url the search phrases FOREACH google_and_bing_urls GENERATE FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as (term:chararray) ; search_terms_filtered = -- reject urls that contained no search terms FILTER search_terms BY NOT $0 IS NULL; search_terms_count = -- for each search phrase count the number of weblogs entries that contained it FOREACH (GROUP search_terms_filtered BY $0) GENERATE $0, COUNT($1) AS num ; search_terms_count_sorted = -- take the top 50 results LIMIT (ORDER search_terms_count BY num DESC) 50; STORE search_terms_count_sorted INTO '$OUTPUT/top_50_search_terms_from_bing_google';
  • 19. (GROUP logs BY EXTRACT(referrer, '(http://[a-z0- 9.-]+)')) (GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr, '(d+.d+.d+)'))) FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as (term:chararray) Learning regular expressions will help you with scripting