There are two reasons regular expressions are so hard to read and are so error prone. One, the syntax is terse. Two, programmers ignore all normal programming practices. This talk reintroduces white space, structure, and basic verification/testing and then calls them "Best Practices."
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Regex Best Practices
1. Regular Expression Best Practices
Tony Stubblebine
tony@tonystubblebine.com
www.stubbleblog.com
@tonystubblebine
2. Tabbed indentation is a sin but this isn't?
$string =~ s<
(?:http://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).
)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)
){3}))(?::(?:d+))?)(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-F
d]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{
2}))|[;:@&=])*))*)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{
2}))|[;:@&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?
:%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-
fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-
)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?
:d+)(?:.(?:d+)){3}))(?::(?:d+))?))(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'()
,]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?:
...................
Abigail, comp.lang.perl.misc,
http://aspn.activestate.com/ASPN/Cookbook/Rx/Recipe/59864
3. Best Practices for Any Programming
There are programming fundamentals that are
routinely ignored by regular expression writers.
Put a line break after statements and space
between expressions.
Throw in a comment or two.
Use subroutines and modules to show structure
and avoid duplication.
Test.
4. Good Code
# Given a URL/URI, fetches it.
# Returns an HTTP::Response object.
sub get {
my $self = shift; my $uri = shift;
$uri = $self->base
? URI->new_abs( $uri, $self->base )
: URI->new( $uri );
return $self->SUPER::get( $uri-
>as_string, @_ );
}
5. What if we didn't include
documentation or whitespace?
sub get{my$self=shift;my$uri=shift;
$uri=$self->base?URI->new_abs($uri,
$self->base):URI-
>new($uri);return$self-
>SUPER::get($uri->as_string,@_);}
6. What if we were also as terse as
possible?
So:
No documentation
No whitespace
One character variable and method names
7. We'd have a regular expression.
sub g{my($s,$u)=@_;$u=$s->b?U-> n($u,
$s->b):U->q($u);return$s-
>SUPER::g($u->a,@_);}
8. What do we want from best
practices?
Practices that maximize desired goals in certain
applications.
Goals of regex best practices:
Maintainability
Correctness
Development Speed
9. #1: Use Extended Whitespace
Add indentation, newlines, and comments to regular
expressions
Usage /x: m/regex/x
# Look for green or red foxes
$text =~ /(green | red)
s
fox (es)?
# Allow more than one
/x;
11. Before
What does this match?
$text =~ m/^([01]?dd?|2[0-4]d|
25[0-5]).([01]?dd?|2[0-4]d|
25[0-5]).([01]?dd?|2[0-4]d|
25[0-5]).([01]?dd?|2[0-4]d|
25[0-5])$/;
12. After
$text =~ m/
# Match IP addresses like 169.146.10.45
^ # Start of string
([01]?dd?|2[0-4]d|25[0-5])
# Number, 0-255
.([01]?dd?|2[0-4]d|25[0-5])
# 0-255
.([01]?dd?|2[0-4]d|25[0-5])
# 0-255
.([01]?dd?|2[0-4]d|25[0-5])
# 0-255
$/x;
13. #2 Test
You don't know your data.
And you have a typo in your regex.
Guaranteed surprises on both fronts.
14. Fun Gotcha
What file does this code open?
$file =
"/etc/passwd0/var/www/index.html";
if ( $file =~ m/^ .* .html/x ) {
open (FILE, "$file);
}
15. Typical Gotcha
This matches foo.gif
But also... foojpg and jpg.doc
# match image files
m/ . gif | jpg | jpeg | png $/x
16. Test framework
Write your regular expressions in a place where
you can test them.
Build up a list of positive and negative matches
Include list in your documentation, ex:
# matches 800-555-1212 but not
# 800.555.1212 or 800-BETS-OFF
17. Hackers Test Framework
Your “framework” could be this simple:
foreach my $test (@tests) {
# looks like an image file?
if (
$test =~ m/ . gif | jpg | jpeg | png $/x ) {
print "Matched on $testn";
} else {
print "Failed match on $testn";
}
}
18. Real Tests Are Better
my @match = ("foo.gif", "foo.bar.jpg", "bar_foo.gif.jpg.png");
my @fail = ("gif.foo", "foo.gif.", "foopng", "foo.jpeg.bar");
sub match {
return $_[0] =~ m/ . gif | jpg | jpeg | png $/x;
}
foreach my $test (@match) {
ok( match($test), "$test matches");
}
foreach my $test (@fail) {
ok( !match($test), "$test fails to match");
}
19. #3 Use Structure
... as a slow-witted human being I have a very
small head and I had better learn to live with it
and to respect my limitations and give them full
credit, rather than to try to ignore them, for the
latter vain effort will be punished by failure.
~ Edsger Dijkstra
20. Breaking up an email regex
We can write an email regex that looks like this:
m/$user@$domain/
Build your regexes from smaller regexes like this:
$user = "w+";
$domain = qr/w+.(w+.)*www?/i;
21. Use Post Processing
It's easier to say a number is <= 255 in code than it is as
a regular expression.
# IP Address check
$ip =~ m/^(d{1,3}).(d{1,3}).(d{1,3}).
(d{1,3})$/;
foreach my $num ($1, $2, $3, $4) {
$failure++ unless $num < 256;
}
22. #4. Good habits
Regex are hard to debug, so avoid errors.
Error avoidance habits:
Group alternations with parentheses
Use lazy quantifiers
Don't use regular expressions
23. Group Alternations
Group your alternations. In this regex, the dot and
end of string ($) are not part of your alternation.
m/ . (gif | jpg | jpeg | png) $/x
24. Use Lazy Quantifiers
Use lazy quantifiers. It's easier to say when to
stop.
<td>.*?</td>
25. Lazy Quantifiers...
Compare that to
#Matches too much
$text = "<td>foo</td><td>bar</td>";
$text =~ m!<td>.*</td>!;
#Matches too little
$text = "<td>foo <b>bar</b> </td>";
$text = m/<td>[^<]*/;
26. Don't use regular expressions
Regular expressions don't deal well with
nesting
$text = "<td> foo
<table><tr><td>bar</td>...";
$text =~ m!<td> .*? </td>!;
Use something better an HTML or XML parsing
library instead.
27. Don't use regular expressions
Regular expressions don't deal well with
nesting
$text = "<td> foo
<table><tr><td>bar</td>...";
$text =~ m!<td> .*? </td>!;
Use something better an HTML or XML parsing
library instead.
28. #5. Optimize Last
It's more common for regular expressions to be
broken then to be slow
Optimize last.
Start with the quantifiers
29. Optimizing Quantifiers
# This is slow because the match backtracks
from the end
# of the file
$text = "M1 text i'm looking for M2 thousand
more characters to come...";
$text =~ m/M1 (.*) M2/s;
# This is slow because the match looks for
</body> at
# (nearly) every position.
$html =~ m!<body> (.*?) </body>!xs;
30. Buy The Book!
Available from Amazon for $9.95
http://bit.ly/regexpr
Thank you for reading!
I'm tony@tonystubblebine.com