SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Hadley Wickham
Stat405Regular expressions
Tuesday, 9 November 2010
1. Recap
2. Regular expressions
3. Other string processing functions
Tuesday, 9 November 2010
Recap
library(stringr)
load("email.rdata")
str(contents) # second 100 are spam
# Extract headers and bodies
breaks <- str_locate(contents, "nn")
header <- str_sub(contents, end = breaks[, 1])
body <- str_sub(contents, start = breaks[, 2])
Tuesday, 9 November 2010
# To parse into fields
h <- header[1]
lines <- str_split(h, "n")[[1]]
continued <- str_sub(lines, 1, 1) %in% c(" ", "t")
# This is a neat trick!
groups <- cumsum(!continued)
fields <- rep(NA, max(groups))
for (i in seq_along(fields)) {
fields[i] <- str_c(lines[groups == i],
collapse = "n")
}
Tuesday, 9 November 2010
We split the content into header and
body. And split up the header into fields.
Both of these tasks used fixed strings
What if the pattern we need to match is
more complicated?
Recap
Tuesday, 9 November 2010
# What if we could just look for lines that
# don't start with whitespace?
break_pos <- str_locate_all(h, "n[^ t]")[[1]]
break_pos[, 2] <- break_pos[, 1]
field_pos <- invert_match(break_pos)
str_sub(h, field_pos[, 1], field_pos[, 2])
Tuesday, 9 November 2010
Matching challenges
• How could we match a phone number?
• How could we match a date?
• How could we match a time?
• How could we match an amount of
money?
• How could we match an email address?
Tuesday, 9 November 2010
Pattern matching
Each of these types of data have a fairly
regular pattern that we can easily pick out
by eye
Today we are going to learn about regular
expressions, which are a pattern
matching language
Almost every programming language has
regular expressions, often with a slightly
different dialect. Very very useful
Tuesday, 9 November 2010
First challenge
• Matching phone numbers
• How are phone numbers normally
written?
• How can we describe this format?
• How can we extract the phone
numbers?
Tuesday, 9 November 2010
Nothing. It is a generic folder that has Enron Global Markets on
the cover. It is the one that I sent you to match your insert to
when you were designing. With the dots. I am on my way over for a
meeting, I'll bring one.
Juli Salvagio
Manager, Marketing Communications
Public Relations
Enron-3641
1400 Smith Street
Houston, TX 77002
713-345-2908-Phone
713-542-0103-Mobile
713-646-5800-Fax
Tuesday, 9 November 2010
Mark,
Good speaking with you. I'll follow up when I get your email.
Thanks,
Rosanna
Rosanna Migliaccio
Vice President
Robert Walters Associates
(212) 704-9900
Fax: (212) 704-4312
mailto:rosanna@robertwalters.com
http://www.robertwalters.com
Tuesday, 9 November 2010
Write down a description, in words, of the
usual format of phone numbers.
Your turn
Tuesday, 9 November 2010
Phone numbers
• 10 digits, normally grouped 3-3-4
• Separated by space, - or ()
• How can we express that with a computer
program? We’ll use regular expressions
• [0-9]: matches any number between 0
and 9
• [- ()]: matches -, space, ( or )
Tuesday, 9 November 2010
phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()]
[0-9][0-9][0-9][0-9]"
phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}"
test <- body[10]
cat(test)
str_detect(test, phone2)
str_locate(test, phone2)
str_locate_all(test, phone2)
str_extract(test, phone2)
str_extract_all(test, phone2)
Tuesday, 9 November 2010
Qualifier ≥ ≤
? 0 1
+ 1 Inf
* 0 Inf
{m,n} m n
{,n} 0 n
{m,} m Inf
Tuesday, 9 November 2010
# What do these regular expression match?
mystery1 <- "[0-9]{5}(-[0-9]{4})?"
mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}"
mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}"
mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+
([a-z0-9-]*[a-z0-9]+)?)+)*"
# Think about them first, then input to
# http://xenon.stanford.edu/~xusch/regexp/analyzer.html
# (select java for language - it's closest to R for regexps)
Tuesday, 9 November 2010
New features
• () group parts of a regular expression
• . matches any character
(. matches a .)
• d matches a digit, s matches a space
• Other characters that need to be
escaped: $, ^
Tuesday, 9 November 2010
Tuesday, 9 November 2010
Regular expression
escapes
Tricky, because we are writing strings, but the
regular expression is the contents of the
string.
"a.b" represents the regular expression a
.b, which matches a.b only
"a.b" is an error
"" represents the regular expression 
which matches 
Tuesday, 9 November 2010
Your turn
Improve our initial regular expression for
matching a phone number.
Hints: use the tools linked from the website
to help. You can use str_extract_all(body,
pattern) to extract matches from the emails
Tuesday, 9 November 2010
phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}"
str_extract_all(body, phone3)
phone4 <- "[0-9][0-9+() -]{6,}[0-9]"
str_extract_all(body, phone4)
Tuesday, 9 November 2010
Regexp String Matches
. "." Any single character
. "." .
d "d" Any digit
s "s" Any white space
" """ "
 "" 
b "b" Word border
Tuesday, 9 November 2010
Your turn
Create a regular expression to match a
date. Test it against the following cases:
c("10/14/1979", "20/20/1945",
"1/1/1905", "5/5/5")
Create a regular expression to match a
time. Test it against the following cases:
c("12:30 pm", "2:15 AM", "312:23 pm",
"1:00 american", "08:20 am")
Tuesday, 9 November 2010
dates <- c("10/14/1979", "20/20/1945", "1/1/1905", "5/5/5")
str_detect(dates, "[1]?[0-9]/[1-3]?[0-9]/[0-9]{2,4}")
times <- c("12:30 pm", "2:15 AM", "312:23 pm",
"1:00 american", "08:20 am")
str_detect(times, "[01]?[0-9]:[0-5][0-9] ([Aa]|[Pp])[Mm]")
str_detect(times, "b1?[0-9]:[0-5][0-9] ([Aa]|[Pp])[Mm]b")
Tuesday, 9 November 2010
Your turn
Create a regular expression that matches
dates like "12 April 2008". Make sure your
expression works when the date is in a
sentence: "On 12 April 2008, I went to
town."
Extract just the month (hint: use str_split)
Tuesday, 9 November 2010
test <- c("15 April 2005", "28 February 1982",
"113 July 1984")
dates <- str_extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}",
test)
date <- dates[[1]]
str_split(date, " ")[[1]][2]
Tuesday, 9 November 2010
Regexp String Matches
[abc] "[abc]" a, b, or c
[a-c] "[a-c]" a, b, or c
[ac-] "[ac-"] a, c, or -
[.] "[.]" .
[.] "[.]" . or 
[ae-g.] "[ae-g.] a, e, f, g, or .
Tuesday, 9 November 2010
Regexp String Matches
[^abc] "[^abc]" Not a, b, or c
[^a-c] "[^a-c]" Not a, b, or c
[ac^] "[ac^]" a, c, or ^
Tuesday, 9 November 2010
Regexp String Matches
^a "^a" a at start of string
a$ "a$" a at end of string
^a$ "^a$" complete string =
a
$a "$a" $a
Tuesday, 9 November 2010
Your turn
Write a regular expression to match dollar
amounts. Compare the frequency of
mentioning money in spam and non-
spam emails.
Tuesday, 9 November 2010
money1 <- "$[0-9, ]+[0-9]"
str_extract_all(body, money1)
money2 <- "$[0-9, ]+[0-9](.[0-9]+)?"
str_extract_all(body, money2)
mean(str_detect(body[1:100], money2), na.rm = T)
mean(str_detect(body[101:200], money2), na.rm = T)
Tuesday, 9 November 2010
Function Parameters Result
str_detect string, pattern logical vector
str_locate string, pattern numeric matrix
str_extract string, pattern character vector
str_replace
string, pattern,
replacement
character vector
str_split_fixed string, pattern character matrix
Tuesday, 9 November 2010
Single Multiple
(output usually a list)
str_detect
str_locate str_locate_all
str_extract str_extract_all
str_replace str_replace_all
str_split_fixed str_split
Tuesday, 9 November 2010

Más contenido relacionado

La actualidad más candente

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Goran S. Milovanovic
 
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadSimple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadBasis Technology
 
Simple fuzzy name matching in elasticsearch paris meetup
Simple fuzzy name matching in elasticsearch   paris meetupSimple fuzzy name matching in elasticsearch   paris meetup
Simple fuzzy name matching in elasticsearch paris meetupBasis Technology
 
Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3Goran S. Milovanovic
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 
Flat Filer Presentation
Flat Filer PresentationFlat Filer Presentation
Flat Filer Presentationalibby45
 
SQL with PostgreSQL - Getting Started
SQL with PostgreSQL - Getting StartedSQL with PostgreSQL - Getting Started
SQL with PostgreSQL - Getting StartedOr Chen
 

La actualidad más candente (7)

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
 
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadSimple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
 
Simple fuzzy name matching in elasticsearch paris meetup
Simple fuzzy name matching in elasticsearch   paris meetupSimple fuzzy name matching in elasticsearch   paris meetup
Simple fuzzy name matching in elasticsearch paris meetup
 
Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Flat Filer Presentation
Flat Filer PresentationFlat Filer Presentation
Flat Filer Presentation
 
SQL with PostgreSQL - Getting Started
SQL with PostgreSQL - Getting StartedSQL with PostgreSQL - Getting Started
SQL with PostgreSQL - Getting Started
 

Destacado (7)

27 development
27 development27 development
27 development
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
24 modelling
24 modelling24 modelling
24 modelling
 
27 development
27 development27 development
27 development
 
03 Conditional
03 Conditional03 Conditional
03 Conditional
 
02 Ddply
02 Ddply02 Ddply
02 Ddply
 
Math Studies conditional probability
Math Studies conditional probabilityMath Studies conditional probability
Math Studies conditional probability
 

Similar a 22 spam

Regular expressions
Regular expressionsRegular expressions
Regular expressionsEran Zimbler
 
Programming with R in Big Data Analytics
Programming with R in Big Data AnalyticsProgramming with R in Big Data Analytics
Programming with R in Big Data AnalyticsArchana Gopinath
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
Lec09-CS110 Computational Engineering
Lec09-CS110 Computational EngineeringLec09-CS110 Computational Engineering
Lec09-CS110 Computational EngineeringSri Harsha Pamu
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
 
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]Goran S. Milovanovic
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxSreeLaya9
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programaciónSoftware Guru
 

Similar a 22 spam (20)

08 functions
08 functions08 functions
08 functions
 
06 data
06 data06 data
06 data
 
24 Spam
24 Spam24 Spam
24 Spam
 
22 Spam
22 Spam22 Spam
22 Spam
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Programming with R in Big Data Analytics
Programming with R in Big Data AnalyticsProgramming with R in Big Data Analytics
Programming with R in Big Data Analytics
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
P3 2017 python_regexes
P3 2017 python_regexesP3 2017 python_regexes
P3 2017 python_regexes
 
04 reports
04 reports04 reports
04 reports
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
Lec09-CS110 Computational Engineering
Lec09-CS110 Computational EngineeringLec09-CS110 Computational Engineering
Lec09-CS110 Computational Engineering
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
04 Reports
04 Reports04 Reports
04 Reports
 
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
13 case-study
13 case-study13 case-study
13 case-study
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptx
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 

Más de Hadley Wickham (18)

20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
03 extensions
03 extensions03 extensions
03 extensions
 
02 large
02 large02 large
02 large
 
01 intro
01 intro01 intro
01 intro
 
25 fin
25 fin25 fin
25 fin
 

Último

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Último (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

22 spam

  • 2. 1. Recap 2. Regular expressions 3. Other string processing functions Tuesday, 9 November 2010
  • 3. Recap library(stringr) load("email.rdata") str(contents) # second 100 are spam # Extract headers and bodies breaks <- str_locate(contents, "nn") header <- str_sub(contents, end = breaks[, 1]) body <- str_sub(contents, start = breaks[, 2]) Tuesday, 9 November 2010
  • 4. # To parse into fields h <- header[1] lines <- str_split(h, "n")[[1]] continued <- str_sub(lines, 1, 1) %in% c(" ", "t") # This is a neat trick! groups <- cumsum(!continued) fields <- rep(NA, max(groups)) for (i in seq_along(fields)) { fields[i] <- str_c(lines[groups == i], collapse = "n") } Tuesday, 9 November 2010
  • 5. We split the content into header and body. And split up the header into fields. Both of these tasks used fixed strings What if the pattern we need to match is more complicated? Recap Tuesday, 9 November 2010
  • 6. # What if we could just look for lines that # don't start with whitespace? break_pos <- str_locate_all(h, "n[^ t]")[[1]] break_pos[, 2] <- break_pos[, 1] field_pos <- invert_match(break_pos) str_sub(h, field_pos[, 1], field_pos[, 2]) Tuesday, 9 November 2010
  • 7. Matching challenges • How could we match a phone number? • How could we match a date? • How could we match a time? • How could we match an amount of money? • How could we match an email address? Tuesday, 9 November 2010
  • 8. Pattern matching Each of these types of data have a fairly regular pattern that we can easily pick out by eye Today we are going to learn about regular expressions, which are a pattern matching language Almost every programming language has regular expressions, often with a slightly different dialect. Very very useful Tuesday, 9 November 2010
  • 9. First challenge • Matching phone numbers • How are phone numbers normally written? • How can we describe this format? • How can we extract the phone numbers? Tuesday, 9 November 2010
  • 10. Nothing. It is a generic folder that has Enron Global Markets on the cover. It is the one that I sent you to match your insert to when you were designing. With the dots. I am on my way over for a meeting, I'll bring one. Juli Salvagio Manager, Marketing Communications Public Relations Enron-3641 1400 Smith Street Houston, TX 77002 713-345-2908-Phone 713-542-0103-Mobile 713-646-5800-Fax Tuesday, 9 November 2010
  • 11. Mark, Good speaking with you. I'll follow up when I get your email. Thanks, Rosanna Rosanna Migliaccio Vice President Robert Walters Associates (212) 704-9900 Fax: (212) 704-4312 mailto:rosanna@robertwalters.com http://www.robertwalters.com Tuesday, 9 November 2010
  • 12. Write down a description, in words, of the usual format of phone numbers. Your turn Tuesday, 9 November 2010
  • 13. Phone numbers • 10 digits, normally grouped 3-3-4 • Separated by space, - or () • How can we express that with a computer program? We’ll use regular expressions • [0-9]: matches any number between 0 and 9 • [- ()]: matches -, space, ( or ) Tuesday, 9 November 2010
  • 14. phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()] [0-9][0-9][0-9][0-9]" phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}" test <- body[10] cat(test) str_detect(test, phone2) str_locate(test, phone2) str_locate_all(test, phone2) str_extract(test, phone2) str_extract_all(test, phone2) Tuesday, 9 November 2010
  • 15. Qualifier ≥ ≤ ? 0 1 + 1 Inf * 0 Inf {m,n} m n {,n} 0 n {m,} m Inf Tuesday, 9 November 2010
  • 16. # What do these regular expression match? mystery1 <- "[0-9]{5}(-[0-9]{4})?" mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}" mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}" mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+ ([a-z0-9-]*[a-z0-9]+)?)+)*" # Think about them first, then input to # http://xenon.stanford.edu/~xusch/regexp/analyzer.html # (select java for language - it's closest to R for regexps) Tuesday, 9 November 2010
  • 17. New features • () group parts of a regular expression • . matches any character (. matches a .) • d matches a digit, s matches a space • Other characters that need to be escaped: $, ^ Tuesday, 9 November 2010
  • 19. Regular expression escapes Tricky, because we are writing strings, but the regular expression is the contents of the string. "a.b" represents the regular expression a .b, which matches a.b only "a.b" is an error "" represents the regular expression which matches Tuesday, 9 November 2010
  • 20. Your turn Improve our initial regular expression for matching a phone number. Hints: use the tools linked from the website to help. You can use str_extract_all(body, pattern) to extract matches from the emails Tuesday, 9 November 2010
  • 21. phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}" str_extract_all(body, phone3) phone4 <- "[0-9][0-9+() -]{6,}[0-9]" str_extract_all(body, phone4) Tuesday, 9 November 2010
  • 22. Regexp String Matches . "." Any single character . "." . d "d" Any digit s "s" Any white space " """ " "" b "b" Word border Tuesday, 9 November 2010
  • 23. Your turn Create a regular expression to match a date. Test it against the following cases: c("10/14/1979", "20/20/1945", "1/1/1905", "5/5/5") Create a regular expression to match a time. Test it against the following cases: c("12:30 pm", "2:15 AM", "312:23 pm", "1:00 american", "08:20 am") Tuesday, 9 November 2010
  • 24. dates <- c("10/14/1979", "20/20/1945", "1/1/1905", "5/5/5") str_detect(dates, "[1]?[0-9]/[1-3]?[0-9]/[0-9]{2,4}") times <- c("12:30 pm", "2:15 AM", "312:23 pm", "1:00 american", "08:20 am") str_detect(times, "[01]?[0-9]:[0-5][0-9] ([Aa]|[Pp])[Mm]") str_detect(times, "b1?[0-9]:[0-5][0-9] ([Aa]|[Pp])[Mm]b") Tuesday, 9 November 2010
  • 25. Your turn Create a regular expression that matches dates like "12 April 2008". Make sure your expression works when the date is in a sentence: "On 12 April 2008, I went to town." Extract just the month (hint: use str_split) Tuesday, 9 November 2010
  • 26. test <- c("15 April 2005", "28 February 1982", "113 July 1984") dates <- str_extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}", test) date <- dates[[1]] str_split(date, " ")[[1]][2] Tuesday, 9 November 2010
  • 27. Regexp String Matches [abc] "[abc]" a, b, or c [a-c] "[a-c]" a, b, or c [ac-] "[ac-"] a, c, or - [.] "[.]" . [.] "[.]" . or [ae-g.] "[ae-g.] a, e, f, g, or . Tuesday, 9 November 2010
  • 28. Regexp String Matches [^abc] "[^abc]" Not a, b, or c [^a-c] "[^a-c]" Not a, b, or c [ac^] "[ac^]" a, c, or ^ Tuesday, 9 November 2010
  • 29. Regexp String Matches ^a "^a" a at start of string a$ "a$" a at end of string ^a$ "^a$" complete string = a $a "$a" $a Tuesday, 9 November 2010
  • 30. Your turn Write a regular expression to match dollar amounts. Compare the frequency of mentioning money in spam and non- spam emails. Tuesday, 9 November 2010
  • 31. money1 <- "$[0-9, ]+[0-9]" str_extract_all(body, money1) money2 <- "$[0-9, ]+[0-9](.[0-9]+)?" str_extract_all(body, money2) mean(str_detect(body[1:100], money2), na.rm = T) mean(str_detect(body[101:200], money2), na.rm = T) Tuesday, 9 November 2010
  • 32. Function Parameters Result str_detect string, pattern logical vector str_locate string, pattern numeric matrix str_extract string, pattern character vector str_replace string, pattern, replacement character vector str_split_fixed string, pattern character matrix Tuesday, 9 November 2010
  • 33. Single Multiple (output usually a list) str_detect str_locate str_locate_all str_extract str_extract_all str_replace str_replace_all str_split_fixed str_split Tuesday, 9 November 2010