6. Some people, when confronted
with a problem, think, "I know,
I'll use regular expressions."
Now they have two problems.
Jaime Zawinski
12 Aug, 1997
http://regex.info/blog/2006-09-15/247
http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-
problems.html
The point is not so much the evils of regular expressions, but the evils of overuse of it.
8. Formal Language
Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
9. Formal Language
Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
10. Formal Language
Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
• Σ*: The set of all words over Σ
11. Formal Language
over Σ
• A subset L of Σ* (with various properties)
• L can be finite, and enumerate well-formed
words, but often infinite
12. Example
• Language L over Σ = {a,b}
• 'a' is a word
• a word may be obtained by appending 'ab'
to an existing word
• only words thus formed are legal
18. Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the
singleton language {a} is a regular language.
19. Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the
singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B
(union), A•B (concatenation), and A*
(Kleene star) are regular languages
20. Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the
singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B
(union), A•B (concatenation), and A*
(Kleene star) are regular languages
• No other languages over Σ are regular.
23. Regular? Expressions
• It turns out that some expressions are
more powerful and expresses non-regular
languages
• Language of 'squares': (.*)1
• a, aa, aaaa, WikiWiki
24. How does Regexp
work?
• Build a finite state automaton representing
a given regular expression
• Feed the String to the regular expression
and see if the match succeeds
34. /a$/
zyxwvutsrqponmlkjihgfedcba
^
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
35. /a$/
zyxwvutsrqponmlkjihgfedcba
^
zyxwvutsrqponmlkjihgfedcba
^
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
36. /a$/
zyxwvutsrqponmlkjihgfedcba
^
zyxwvutsrqponmlkjihgfedcba
^
zyxwvutsrqponmlkjihgfedcba
^
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
37. /a$/
zyxwvutsrqponmlkjihgfedcba
^
zyxwvutsrqponmlkjihgfedcba
^
zyxwvutsrqponmlkjihgfedcba
^
zyxwvutsrqponmlkjihgfedcba
^
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
38. /a$/
zyxwvutsrqponmlkjihgfedcba
^
zyxwvutsrqponmlkjihgfedcba
^
zyxwvutsrqponmlkjihgfedcba
^
zyxwvutsrqponmlkjihgfedcba
^
⋮
zyxwvutsrqponmlkjihgfedcba
^
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
39. ^s*(.*)s*$
abc d a dfadg
^
abc d a dfadg
^
abc d a dfadg
^
abc d a dfadg
^
# matches 'abc d a dfadg '
40. a?a?a?…a?aaa…a
def pathological(n=5)
Regexp.new('a?' * n + 'a' * n)
end
1.upto(40) do |n|
print n, ": "
print Time.now, "n" if 'a'*n =~ pathological(n)
end
47. What's the problem?
#! /usr/bin/env perl
$a = "abcndef";
if ($a =~ /^d/) {
print "yesn";
}
if ($a =~ /^d/m) {
print "yes nown";
}
# prints 'yes now'
also note the difference in what /m means
48. What's the problem?
#! /usr/bin/env ruby
a = "abcndef";
if (a =~ /^d/)
p "yes"
end
http://guides.rubyonrails.org/security.html#regular-expressions
49. Security Implications
class File < ActiveRecord::Base
validates :name, :format => /^[w.-+]+$/
end
http://guides.rubyonrails.org/security.html#regular-expressions
57. Prefer Character Class
to Alterations
require 'benchmark'
# simple benchmark for alternations and character class
n = 5_000
str = 'cafebabedeadbeef'*5_000
Benchmark.bmbm do |x|
x.report('alternation') do
str =~ /^(a|b|c|d|e|f)+$/
end
x.report('character class') do
str =~ /^[a-f]+$/
end
end
58. Benchmarks
Ruby 1.8.7
user system total real
alternation 0.030000 0.010000 0.040000 ( 0.036702)
character class 0.000000 0.000000 0.000000 ( 0.004704)
Ruby 2.0.0
user system total real
alternation 0.020000 0.010000 0.030000 ( 0.023139)
character class 0.000000 0.000000 0.000000 ( 0.009641)
JRuby 1.7.4.dev
user system total real
alternation 0.030000 0.000000 0.030000 ( 0.021000)
character class 0.010000 0.000000 0.010000 ( 0.007000)
59. Beware of Character
Classes
# case-insensitively match any non-word character…
# one is unlike the others
'r' =~ /(?i:[W])/
's' =~ /(?i:[W])/ matches, even if 's' is a word character
't' =~ /(?i:[W])/
https://bugs.ruby-lang.org/issues/4044