2. How repetition tokens match a test string?
Repetition tokens are greedy.
They continue to match until the last matching token.
Let’s check with a valid HTML. http://rubular.com/r/nVoDVeAafp
How do we solve this greediness?
3. How to fix greediness?
Quick fix is to use laziness. By adding a ? after +.
So, <.+?> matches only the HTML tags. Check
http://rubular.com/r/yoEJztaClW
A better alternative is to use negative character
class. <[^>]+>. This is much more efficient in terms
of backtracking and hence returns results faster.
Check http://rubular.com/r/WHjIrJW3v7
4. Possessive Quantifiers
Greedy tokens match as many repeats as
possible. Lazy tokens match as few repeats
as possible. Then try permutations by
backtracking to match the test string.
Possessive quantifiers, on the other hand,
hold whatever was matched once and
forget the backtracking position. So the
regex engine returns as soon as there is no
match and doesn’t backtrack.
/D*+g/ /string/
Why??
Because, D*+ matches all of
string and unlike lazy/greedy
tokens, Possessive quantifiers
can’t backtrack. Therefore a
permutation to match strin
with repeat tokens & g as
literal character is never tried.
8. Alternation
Lowest precedence among all regex
operators.
Matches single one of the many regexes.
/I have a cat but no dog./
/I have three clown fish as pet./
What’s your pet?
cat|dog|fish
9. Word Boundaries: Zero Length Assertions
There are three different positions that qualify
as word boundaries:
● Before the first character in the string,
if the first character is a word
character.
● After the last character in the string, if
the last character is a word character.
● Between two characters in the string,
where one is a word character and the
other is not a word character.
/bw+b/ /bat=cat/
/Bw+B/ /bat=cat/
10. Groups/Backreference
Token Property Regex Example Test String
(group) Club characters
together as one unit
/work(shop)?/ I work at a computer
workshop.
1 Default numeric
reference for a group
/(w+)=1/ Is cat=bat or rat=rat?
(?<n>group) Named groups /(?<word>w+)/ $!, cat eats rat.
k{n} Named reference for
a group
/(?<a1>w+)=k{a1}/ Is cat=bat or rat=rat?
(?:group) Non-capturing
groups
/work(?:shop)?/ I work at a computer
workshop.
(?>group) Atomic groups /a(?>bc|b)c/ abbc, abc
12. Alternation/Word Boundary/Groups
What’s your language?
c|c++|java|javascript
/I use java for android
development and javascript for
everything else./
Challenge 2:
will this
regex ever
match c++ and
javascript?
Fix it to be
“inclusive”.
18. Unicode encoding Sample character Regex Unicode Regex
Encoded as 2 code
points
å =
U+0061(a)U+0300(`)
^..$ P{M}p{M}*+ or
(>P{M}p{M}*)
Encoded as one
code point
U+00E0 &.$ u00E0
Any unicode
character
Punctuation mark,
numerals etc
.|.. X
20. Mathematics Behind Regex
● Originated in 1956, when mathematician Stephen
Cole Kleene described regular languages using
his mathematical notation called regular sets.
● Entered popular use from 1968 in two uses:
pattern matching in a text editor and lexical
analysis in a compiler.
● Among first uses, Ken Thompson, implemented
first Regex engine into QED editor and later in
UNIX editor ed. That led to `grep`. Guess what
grep is: g/re/p