Más contenido relacionado La actualidad más candente (20) Similar a Regular Expressions -- SAS and Perl (20) Más de Mark Tabladillo (20) Regular Expressions -- SAS and Perl1. Regular Expressions –
SAS® (RX) vs. Perl (PRX)
P l
Mark Tabladillo Ph.D.
April 10, 2005
© 2005, markTab Consulting, All Rights Reserved
2. Motivation
The SAS System Version 9 introduces Perl
regular expressions (PRX)
Earlier software versions already had SAS
regular expressions (RX)
© 2005, markTab Consulting, All Rights Reserved
3. Purpose
This presentation will compare and
contrast the two types of regular
expressions (RX and PRX) from both the
functionality and performance viewpoints
The goal: Offer recommendations on
when to use the two types
Application: Two generic examples will
A li ti T i l ill
illustrate the recommended strategy
© 2005, markTab Consulting, All Rights Reserved
4. Outline
Background
Similarities between SAS (RX) and Perl
Regular Expressions (PRX)
Unique Perl Regular Expression (PRX)
Capabilities
C biliti
Recommended Strategy for SAS (RX) and
Perl Regular Expressions (PRX)
Two Examples of Recommended Strategy
p gy
© 2005, markTab Consulting, All Rights Reserved
5. Outline
Background
Similarities between SAS (RX) and Perl
Regular Expressions (PRX)
Unique Perl Regular Expression (PRX)
Capabilities
C biliti
Recommended Strategy for SAS (RX) and
Perl Regular Expressions (PRX)
Two Examples of Recommended Strategy
p gy
© 2005, markTab Consulting, All Rights Reserved
6. Vocabulary
Pattern matching enables you to search for and
g y
extract multiple matching patterns from a character
string in one step, as well as to make several
substitutions in a string in one step
g p
Regular expressions are a pattern language which
provides fast tools for parsing large amounts of text.
Metacharacters are special combinations of
alphanumeric and/or symbolic characters which have
specific meaning in defining a regular expression.
Ch t l
Character classes are single or combinations of
i l bi ti f
alphanumeric and/or symbolic characters which
represent themselves.
© 2005, markTab Consulting, All Rights Reserved
7. Is “One Step Realistic?
One Step”
Practical uses of regular expressions use
more than one step
Regular expressions provide a powerful
parsimonious syntax for string
manipulation
© 2005, markTab Consulting, All Rights Reserved
8. When to Use Regular Expressions
Anything done in regular expressions
could be coded another way
Many people do not use metacharacters in
(for example) Google® searches
Hi h-volume or complex string processing
High-
High l l ti i
(such as in a data step) provides excellent
potential
t ti l
© 2005, markTab Consulting, All Rights Reserved
9. Why Regular Expressions can be
Confusing
C f i
Regular expressions are a combination of:
– Alphanumeric and/or symbolic characters
representing themselves (character classes)
(character classes)
– Special combinations of alphanumeric and/or
symbolic characters (metacharacters) representing
(metacharacters)
zero or more combinations of alphanumeric and/or
symbolic characters
– Specially flagged combinations of alphanumeric
and/or symbolic characters which would normally be
interpreted as metacharacters, but instead represent
themselves (character classes)
(character classes)
© 2005, markTab Consulting, All Rights Reserved
10. Outline
Background
Similarities between SAS (RX) and Perl
Regular Expressions (PRX)
Unique Perl Regular Expression (PRX)
Capabilities
C biliti
Recommended Strategy for SAS (RX) and
Perl Regular Expressions (PRX)
Two Examples of Recommended Strategy
p gy
© 2005, markTab Consulting, All Rights Reserved
11. Similarity One: Parse Function
PARSE is the core function of creating a
regular expression in memory using
metacharacters, and assigning this regular
, g g g
expression to a numeric SAS variable,
called the regular expression ID. ID.
The term ID refers to identification, and
SAS will assign every PARSE function to a
different and unique numeric value, and
diff t d i i l d
track those values automatically.
© 2005, markTab Consulting, All Rights Reserved
12. Similarity One: Parse Function
The programming challenge is to create a
regular expression which generically
describes a character string pattern
Metacharacters for SAS (RX) and Perl
(PRX) regular expressions are usually
different, but either method can be used
to create a similar if not identical result
© 2005, markTab Consulting, All Rights Reserved
13. Similarity One: Example
In this first e a p e (S S Institute, 2003), t e
t s st example (SAS st tute, 003), the
goal is to find a pattern that matches (XXX) XXX-
XXX-
XXXX or XXX-XXX-XXXX for phone numbers in
XXX-XXX-
the United States.
States
– The first three digits are the area code, and by
standardized rules, the area code cannot start with a
zero or a one.
– The fourth through sixth digits are the prefix, and
again by standard rules, the prefix also cannot start
with a zero or one.
– The suffix may have any digit, including zero or one,
in any of the four places.
places
© 2005, markTab Consulting, All Rights Reserved
14. Phone Number: Perl (PRX)
paren = quot;([2-9]dd) ?[2-9]dd-
quot;([2-9] ?[2-9]
ddddquot;;
dash = quot;[2-9]dd-[2-9]dd-ddddquot;;
[2-
[2 9] [2-9] d ;
regexp = quot;/(quot; || paren || quot;)|(quot; || dash ||
quot;)/quot;;
quot;)/quot;
See the Paper for the full code and
explanation
© 2005, markTab Consulting, All Rights Reserved
15. Phone Number: SAS (RX)
paren = quot;'('$'2-9 $d$d ) [ ']$'2-9'$d$d'-
quot;'('$'2-9'$d$d')'[' ']$'2-9'$d$d'-
($2 ]$ 2 9 $d$d
'$d$d$d$dquot;;
dash = quot;$'2-9'$d$d'-'$'2-9'$d$d'-
$ 2 9 $d$d $ 2 9 $d$d
2- 2-
'$d$d$d$dquot;;
regexp = paren || quot;|quot; || d h
dash;
See the Paper for the full code and
explanation
© 2005, markTab Consulting, All Rights Reserved
16. Comparing the Methods
A SAS Macro was created to compare the
methods
One iteration did not show a difference, so
difference
the iterations were increased to 500
SAS (RX) wins at 3.69 seconds compared
i t 3 69 d d
to Perl (PRX) at 3.80 seconds
Point: If speed is an issue, you may try
the two methods to see who wins
© 2005, markTab Consulting, All Rights Reserved
17. Similarity Two: Matching
The matching function uses the regular
expression to determine a specific numeric
position in a string
The return from a match function is a
number representing a character position
© 2005, markTab Consulting, All Rights Reserved
18. Similarity Three: Substring
The substring routine allows for inputting
a regular expression and string, and
outputting a position and length
Routines (unlike functions) can have
variable numbers of inputs and outputs,
outputs
as in the substring routine
© 2005, markTab Consulting, All Rights Reserved
19. Similarity Four: Change
The change routine allows for inputting a
regular expression, a maximum number of
times to replace an old string and
replace, string,
outputs a new string
Both SAS (RX) and Perl (PRX) allow for
changing a string in place
© 2005, markTab Consulting, All Rights Reserved
20. Similarity Five: Free
The free routine releases the memory
allocation for the regular expression
It is recommended to always include a
FREE routine to prevent problems
© 2005, markTab Consulting, All Rights Reserved
21. Outline
Background
Similarities between SAS (RX) and Perl
Regular Expressions (PRX)
Unique Perl Regular Expression (PRX)
Capabilities
C biliti
Recommended Strategy for SAS (RX) and
Perl Regular Expressions (PRX)
Two Examples of Recommended Strategy
p gy
© 2005, markTab Consulting, All Rights Reserved
22. Capture Buffers
Perl (PRX) regular expressions can use
capture buffers, defined as part of a
match explicitly specified in the Perl
p y p
regular expression
The capture buffers are collectively a one-
p y one-
dimensional numbered array of results
(starting at one, not zero)
Example: Parts of a phone number
More than one step is required
p q
© 2005, markTab Consulting, All Rights Reserved
23. Unique Feature One: PRXPOSN
Routine
i
The PRXPOSN routine finds the start
position and length of a numbered capture
buffer
© 2005, markTab Consulting, All Rights Reserved
24. Unique Feature Two: PRXPOSN
Function
i
The PRXPOSN Function uses the positional
capture buffer number to return the actual
string in the capture buffer
This function is probably more useful than
the PRXPOSN routine
© 2005, markTab Consulting, All Rights Reserved
25. Unique Feature Three: PRXPAREN
The PRXPAREN function assumes that the
capture buffer was an ordered hierarchical
array and will return the highest non-
array, non-
missing capture buffer number
See the paper for an example
© 2005, markTab Consulting, All Rights Reserved
26. Unique Feature Four: PRXNEXT
Similar to PRXMATCH the PRXNEXT
PRXMATCH,
routine will iteratively search a string for
matches
Not based on the capture buffer
Useful h
U f l when a string can have multiple,
ti h lti l
even overlapping, matches
© 2005, markTab Consulting, All Rights Reserved
27. Unique Feature Five: PRXDEBUG
The PRXDEBUG routine writes debugging
messages to the log
Provides insight into how regular
expression functions and routines search
through specific strings
Debugging works best when smaller
pieces are checked first, building toward
i h k d fi t b ildi t d
the whole regular expression
© 2005, markTab Consulting, All Rights Reserved
28. Outline
Background
Similarities between SAS (RX) and Perl
Regular Expressions (PRX)
Unique Perl Regular Expression (PRX)
Capabilities
C biliti
Recommended Strategy for SAS (RX) and
Perl Regular Expressions (PRX)
Two Examples of Recommended Strategy
p gy
© 2005, markTab Consulting, All Rights Reserved
29. Recommended Strategy
Use the type which has the desired
functionality
If you don’t know either, start with Perl
don t either
regular expressions (PRX)
If you are l ki at performance or
looking t f
speed issues, try tests both ways (RX and
PRX)
© 2005, markTab Consulting, All Rights Reserved
30. Outline
Background
Similarities between SAS (RX) and Perl
Regular Expressions (PRX)
Unique Perl Regular Expression (PRX)
Capabilities
C biliti
Recommended Strategy for SAS (RX) and
Perl Regular Expressions (PRX)
Two Examples of Recommended Strategy
p gy
© 2005, markTab Consulting, All Rights Reserved
31. Example One: Printer Names
The Universal Naming Convention
describes printers as:
computer nameprinter_shared_name
computer_name printer shared name
computer_name
name
The SYSPRINT option returns or sets the
UNC printer name
© 2005, markTab Consulting, All Rights Reserved
32. Example One: Printer Name
Problem: A variety of legal UNC formats:
– computer_nameprinter_shared_name
computer_name
– (computer_nameprinter shared name)
computer_name printer_shared_name)
computer nameprinter_shared_name
name name)
– (“computer_nameprinter_shared_name’)
(“ computer_nameprinter_shared_name’)
12 printers * 3 formats = 36 combinations
i t f t bi ti
SAS (RX) could be used with 3 separate
regular expressions
Perl (PRX) capture buffer used
( ) p
© 2005, markTab Consulting, All Rights Reserved
33. Example One: PRX
'/(
'/([-w]+|[-w]+)/'
/(
/( w]+|[- w]+)/
The regular expression will extract the
printer name without the braces, or
name, braces
brackets, or quotation marks
See the
S th paper f explanation
for l ti
© 2005, markTab Consulting, All Rights Reserved
34. Example Two: Windows
Subdirectory
S bdi
Get the subdirectory from the longer
string which started with the drive name
and ended with a specific filename:
– X:Sub_Directory_1Sub_Directory_2...Sub
X: Sub_Directory_1Sub_Directory_2...
_Directory_NFilename Extension
_Directory_NFilename.Extension
Directory N
N
As in the previous example, the original
string includes the backslash, which is a
backslash
Perl delimiting metacharacter
© 2005, markTab Consulting, All Rights Reserved
35. Example Two: Regular Expression
'/([A-Za-z]:[.
'/([A-Za-z]:[ -w]+)([ -w]+)([ -
/([A w]+) ([. w]+) ([.
w]+)/'
The regular expression creates three
capture buffers, with the second capture
buffer containing the string of interest
See the paper for a full explanation
© 2005, markTab Consulting, All Rights Reserved
36. Conclusion
With version 9, SAS programmers have
9
two regular expression choices: SAS (RX)
and Perl (PRX)
The presentation described similarities and
differences and offered a recommended
differences,
strategy
The
Th paper contains three detailed
t i th d t il d
examples, and an annotated bibliography
© 2005, markTab Consulting, All Rights Reserved