SlideShare una empresa de Scribd logo
1 de 56
Out With Regex,
 In With Tokens
     Sean Coates
     php|tek 2009
Who is this Sean guy?

• Web Architect at OmniTI (http://omniti.com/)
• Former Editor-in-Chief of php|architect and former
  organizer of php|tek
• PHP Community, Habari, Phergie
• Other conferences (PHP Quebec earlier this year)
• the Twitter: @coates
• Beer Lover (and brewer)
• (I speak too quickly)
“A token is a
categorized block of
text. It can look like
anything; it just needs
to be a useful part of
the structured text.”
                 -Wikipedia
$a = 5 + 7 ;
$a = 5 + 7 ;

    (10 tokens)
$a = 5 + 7 ;



   Whitespace
$a = 5 + 7 ;



       Whitespace
  Variable
$a = 5 + 7 ;


        Assign
       Whitespace
  Variable
$a = 5 + 7 ;

                 Number

        Assign
       Whitespace
  Variable
$a = 5 + 7 ;

                      Add
                 Number

        Assign
       Whitespace
  Variable
$a = 5 + 7 ;

                      Add
                 Number

        Assign      Number
       Whitespace
  Variable
$a = 5 + 7 ;

                      Add
                 Number

        Assign      Number
       Whitespace
                    Terminator
  Variable
Grammar Matters

$a = 5 + 7; // $b
Grammar Matters

$a = 5 + 7; // $b

            Not a Variable
 Variable
Grammar Matters

$a = 5 + 7; // $b

 Variable
            Comment
PHP Example
    <?php

    $a

    =

    5

    +

    7
    ;

    // $b
PHP Example
T_OPEN_TAG     <?php

T_VARIABLE     $a
T_WHITESPACE
               =
T_WHITESPACE
T_LNUMBER      5
T_WHITESPACE
               +
T_WHITESPACE
T_LNUMBER      7
               ;
T_WHITESPACE
T_COMMENT      // $b
“Lexing”
• a Lexer converts a sequence of characters
  into tokens
• “Lexical Analysis”
• Lex, Flex, re2c (lexer generators)
Static vs. Dynamic
          Analysis
• Dynamic: actual execution, practical
  implementations such as pen. testing.
• Static: analysis of code, tokens, opcodes,
  etc. to determine if a particular action will
  take place


• (not the only use for Tokens, though)
Out with Regex
• Find all variables
• Regex:
  /($[a-z_][a-z0-9_]*)/i
Out with Regex
• Find all variables
• Regex:
  /($[a-z_][a-z0-9_]*)/i
• context matters:
  $str = '$a = 5 + 7; // $b';
Regex Fail
<?php
$str = '$a = 5 + 7; // $b';
preg_match_all(
     '/($[a-z_][a-z0-9_]*)/i', $str, $m
);
var_dump($m[0]);
Regex Fail
array(2) {
    [0]=> string(2) quot;$aquot;
    [1]=> string(2) quot;$bquot;
}
Out with Regex
• Find all variables      RONG!
• Regex:
  /($[a-z_][a-z0-9_]*)/i
• context matters:
  $str = '$a = 5 + 7; // $b';
Remember?

$a = 5 + 7; // $b

 Variable
            Not a Variable!
Token Approach
<?php
// look ma, no regex!
$str = '<?php $a = 5 + 7; // $b';
foreach (token_get_all($str) as $t) {
    if (is_array($t) && $t[0] == T_VARIABLE) {
        echo $t[1] . quot;nquot;;
    }
}
// outputs: $a
PHP Example (again)
T_OPEN_TAG     <?php

T_VARIABLE     $a
T_WHITESPACE
               =
T_WHITESPACE
T_LNUMBER      5
T_WHITESPACE
               +
T_WHITESPACE
T_LNUMBER      7
               ;
T_WHITESPACE
T_COMMENT      // $b
Regex can be complicated
            (email validation from MRE)
 [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xff
n015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()]
* (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )*
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff]
| ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-
xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: 
( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-
xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^
x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
Difficult validation
       made simpler
• Email validation is haaaard!
• Validate logical units separately:
 s e a n @ p h p. n e t
Difficult validation
       made simpler
• Email validation is haaaard!
• Validate logical units separately:
 s e a n @ p h p. n e t

                               Domain
Localpart Separator
Difficult validation
       made simpler
• Email validation is haaaard!
• Validate logical units separately:
 s e a n @ p h p. n e t
• Still hard, but validation is restricted to
  different types of data
• BTW, don’t bother (-:
Regex can be complicated
            (email validation from MRE)
 [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xff
n015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()]
* (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )*

                                       strpos($email, ‘@’) !== false
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff]
| ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-
xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: 
( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-
xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^
x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
Dirty Little Secret
• Most tokenizers (lexers) use regular
  expressions to separate tokens
• re2c
• Multiple ways to represent separators,
  whitespace, etc.. simplified with regex
Practical Uses
• Compile source code
• Simple, contextual replacement (e.g. BBCode)
• Friendly line breaks
• “Curly” quotes, special punctuation
• Input validation/stripping
• Refactoring
PHP’s Tokenizer
• Similar in other languages
• Available (and useful!) in userspace
• Built in to PHP (always available)
PHP Execution
• Lex
• Parse
• Compile
• Execute
• Cleanup
PHP Execution
• Lex
• Parse     Tokeny Goodness
• Compile
• Execute
• Cleanup
Tokenizer in Userspace
• token_get_all()
• token_name()
Tokenizer in Userspace
• token_get_all() returns an array of scalars
  and arrays
• A bit hard to work with
• Needs opening tag (<?php or <? depending
  on config)
Tokenizer in Userspace
          (Example)
  print_r(token_get_all('<?php $a = 5 + 7; // $b'));
Array                 [2] => Array      [5] => Array      [8] => Array          [11] => Array
(                        (                 (                 (                     (
  [0] => Array             [0] => 370        [0] => 305        [0] => 370            [0] => 370
     (                     [1] =>            [1] => 5          [1] =>                [1] =>
       [0] => 367          [2] => 1          [2] => 1          [2] => 1              [2] => 1
       [1] => <?php      )                 )                 )                     )
       [2] => 1
     )                [3] => =          [6] => Array      [9] => Array          [12] => Array
                      [4] => Array         (                 (                     (
  [1] => Array           (                   [0] => 370        [0] => 305            [0] => 365
     (                     [0] => 370        [1] =>            [1] => 7              [1] => // $b
       [0] => 309          [1] =>            [2] => 1          [2] => 1              [2] => 1
       [1] => $a           [2] => 1        )                 )                     )
       [2] => 1          )
     )                                  [7] => +          [10] => ;         )
Tokenizer in Userspace
       (Example)
[0] => Array
   (
     [0] => 367     Token Number
     [1] => <?php   Token Text
     [2] => 1       Line Number
   )

[1] => Array
    (
      [0] => 309    token_name(309)
      [1] => $a      == ‘T_VARIABLE’
      [2] => 1
    )
(...)
[3] => =            Scalar (not array)
Practical Example:
<pre>
      Simple Highlighter
<?php
$c = array(
    T_VARIABLE => 'red',
    T_LNUMBER => 'blue',
);
foreach (token_get_all(fread(STDIN, 9999999)) as $t) {
    if (!is_array($t)) {
        echo htmlentities($t);
    } elseif (!isset($c[$t[0]])) {
        echo htmlentities($t[1]);
        continue;
    } else {
        echo '<span style=quot;color: ' . $c[$t[0]] . 'quot;>'
        . htmlentities($t[1]) . '</span>';
    }
}
?>
</pre>
Highlighter Output
<?php
$a = 5 + 7; // $b
<pre>
&lt;?php
<span style=quot;color: redquot;>$a</span> =
<span style=quot;color: bluequot;>5</span> +
<span style=quot;color: bluequot;>7</span>; // $b
</pre>
Entities
•   Hi... I'm Sean
Entities
•   Hi... I'm Sean

•   Hi&#8230; I&#8217;m Sean

•   Hi… I’m Sean
Entities
•   Here's some code <code>$foo = 'bar';</code>

•   Here&#8217;some code
Entities
•   Here's some code <code>$foo = 'bar';</code>

•   Here&#8217;some code <code>$foo = 'bar';</code>

•   Here’s some code <code>$foo = 'bar';</code>
Entities
•   Here's some code <code>$foo = 'bar';</code>

•   Here&#8217;some code <code>$foo = 'bar';</code>

•   Here’s some code <code>$foo = 'bar';</code>
Tokalizer
• PHP token analysis wrapper
• Object-oriented
• Normalized
• Includes a partial parser (in PHP, so it’s
  slow). Doesn’t work with new 5.3
  constructs... yet.
• http://github.com/scoates/tokalizer
Context-aware tools
• phpgrep
regular grep:
     file.php:123: matched line
php grep:
     file.php:123(foo::bar()): matched line
Context-aware tools
• diff-php
regular diff:
@@ -68,6 +68,7 @@
php diff:
@@ -68,6 +68,7 @@ GeshiHighlighterFormatPlugin::do_highlight()
Token dumps
• text token dump
• definition dump (*cough* currently broken)
• html dump
Habari’s HTML
          Tokenizer
• Filter user input (can strip tags intelligently)
• Allow plugins to inject/replace whole
  blocks of HTML without (developer-facing)
  regex
• Facilitate autop, introspection
HTMLPurifier
• Intelligently filters/escapes potentially
  dangerous data
• Token-based approach
• Really difficult
• Code is slow and memory-intensive, but it’s
  extremely complicated
Questions? Contact...

• http://seancoates.com/
• sean@php.net
• http://omniti.com/is/sean-coates
• IRC: scoates (Freenode and EFNet)
• @coates on Twitter (if it happens to be up)

Más contenido relacionado

La actualidad más candente

Perl Sucks - and what to do about it
Perl Sucks - and what to do about itPerl Sucks - and what to do about it
Perl Sucks - and what to do about it2shortplanks
 
Why Go Scales
Why Go ScalesWhy Go Scales
Why Go ScalesEyal Post
 
What's new in PHP 8.0?
What's new in PHP 8.0?What's new in PHP 8.0?
What's new in PHP 8.0?Nikita Popov
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Contextlichtkind
 
From typing the test to testing the type
From typing the test to testing the typeFrom typing the test to testing the type
From typing the test to testing the typeWim Godden
 
R workshop i r basic (4th time)
R workshop i r basic (4th time)R workshop i r basic (4th time)
R workshop i r basic (4th time)Vivian S. Zhang
 
Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)Patricia Aas
 
Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3guesta3202
 
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...adrianoalmeida7
 
Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)osfameron
 
Ramda lets write declarative js
Ramda   lets write declarative jsRamda   lets write declarative js
Ramda lets write declarative jsPivorak MeetUp
 
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...Kevlin Henney
 
Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)James Titcumb
 
C++ Programming - 11th Study
C++ Programming - 11th StudyC++ Programming - 11th Study
C++ Programming - 11th StudyChris Ohk
 
Continuous Delivery As Code
Continuous Delivery As CodeContinuous Delivery As Code
Continuous Delivery As CodeAlex Soto
 
javascript function & closure
javascript function & closurejavascript function & closure
javascript function & closureHika Maeng
 

La actualidad más candente (20)

Perl Sucks - and what to do about it
Perl Sucks - and what to do about itPerl Sucks - and what to do about it
Perl Sucks - and what to do about it
 
Why Go Scales
Why Go ScalesWhy Go Scales
Why Go Scales
 
What's new in PHP 8.0?
What's new in PHP 8.0?What's new in PHP 8.0?
What's new in PHP 8.0?
 
Metadata-driven Testing
Metadata-driven TestingMetadata-driven Testing
Metadata-driven Testing
 
Ae internals
Ae internalsAe internals
Ae internals
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Context
 
From typing the test to testing the type
From typing the test to testing the typeFrom typing the test to testing the type
From typing the test to testing the type
 
R workshop i r basic (4th time)
R workshop i r basic (4th time)R workshop i r basic (4th time)
R workshop i r basic (4th time)
 
Perl 6 by example
Perl 6 by examplePerl 6 by example
Perl 6 by example
 
Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)
 
Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3
 
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
 
Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)
 
Ramda lets write declarative js
Ramda   lets write declarative jsRamda   lets write declarative js
Ramda lets write declarative js
 
Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
 
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
 
Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)
 
C++ Programming - 11th Study
C++ Programming - 11th StudyC++ Programming - 11th Study
C++ Programming - 11th Study
 
Continuous Delivery As Code
Continuous Delivery As CodeContinuous Delivery As Code
Continuous Delivery As Code
 
javascript function & closure
javascript function & closurejavascript function & closure
javascript function & closure
 

Destacado

GCMartinez signed cover letter 2016
GCMartinez   signed cover letter 2016GCMartinez   signed cover letter 2016
GCMartinez signed cover letter 2016Graciela Martinez
 
Smart moves slideshare
Smart moves   slideshareSmart moves   slideshare
Smart moves slideshareSmartMoves_UKK
 
Higado v biliares pancreas
Higado v biliares pancreasHigado v biliares pancreas
Higado v biliares pancreasPaul Martinez
 
Diapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen camposDiapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen camposCarmenCampos16174021
 
Fear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case studyFear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case studySean Porter
 
Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС Center of Energysaving Technologies ECO
 
Infografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestreInfografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestreUBA
 
Presentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambientalPresentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambientalmade0312
 
Mapa conceptual ecologia
Mapa conceptual ecologiaMapa conceptual ecologia
Mapa conceptual ecologiaYuanjian Zheng
 
Tempos e modos do verbo na fundep
Tempos e modos do verbo na fundepTempos e modos do verbo na fundep
Tempos e modos do verbo na fundepma.no.el.ne.ves
 
Alteraciones del sist i 2016
Alteraciones del sist i 2016Alteraciones del sist i 2016
Alteraciones del sist i 2016Ivan A Berne S
 
Musculos de-miembro-inferior
Musculos de-miembro-inferiorMusculos de-miembro-inferior
Musculos de-miembro-inferiorIvan A Berne S
 

Destacado (20)

GCMartinez signed cover letter 2016
GCMartinez   signed cover letter 2016GCMartinez   signed cover letter 2016
GCMartinez signed cover letter 2016
 
Educación para la Sostenibilidad
Educación para la SostenibilidadEducación para la Sostenibilidad
Educación para la Sostenibilidad
 
Smart moves slideshare
Smart moves   slideshareSmart moves   slideshare
Smart moves slideshare
 
Discovering Yoga EN
Discovering Yoga ENDiscovering Yoga EN
Discovering Yoga EN
 
Higado v biliares pancreas
Higado v biliares pancreasHigado v biliares pancreas
Higado v biliares pancreas
 
Diapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen camposDiapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen campos
 
Fear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case studyFear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case study
 
Trabajo final canelo
Trabajo final caneloTrabajo final canelo
Trabajo final canelo
 
Trabajofinalrobertoterminado
TrabajofinalrobertoterminadoTrabajofinalrobertoterminado
Trabajofinalrobertoterminado
 
Proyecto de Vida
Proyecto de VidaProyecto de Vida
Proyecto de Vida
 
Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС
 
Funciones Mentales y Emoción
Funciones Mentales y EmociónFunciones Mentales y Emoción
Funciones Mentales y Emoción
 
Infografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestreInfografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestre
 
Presentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambientalPresentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambiental
 
Guia para la evaluación del enfoque ambiental
Guia para la evaluación del enfoque ambientalGuia para la evaluación del enfoque ambiental
Guia para la evaluación del enfoque ambiental
 
Mapa conceptual ecologia
Mapa conceptual ecologiaMapa conceptual ecologia
Mapa conceptual ecologia
 
Tempos e modos do verbo na fundep
Tempos e modos do verbo na fundepTempos e modos do verbo na fundep
Tempos e modos do verbo na fundep
 
Cisto ovariano funcional
Cisto ovariano funcionalCisto ovariano funcional
Cisto ovariano funcional
 
Alteraciones del sist i 2016
Alteraciones del sist i 2016Alteraciones del sist i 2016
Alteraciones del sist i 2016
 
Musculos de-miembro-inferior
Musculos de-miembro-inferiorMusculos de-miembro-inferior
Musculos de-miembro-inferior
 

Similar a Out with Regex, In with Tokens

My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertextfrankieroberto
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...Codemotion
 
Unsung Heroes of PHP
Unsung Heroes of PHPUnsung Heroes of PHP
Unsung Heroes of PHPjsmith92
 
Impacta - Show Day de Rails
Impacta - Show Day de RailsImpacta - Show Day de Rails
Impacta - Show Day de RailsFabio Akita
 
LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6umapst
 
JSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of ARJSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of ARYusuke Kawasaki
 
Get Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP StreamsGet Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP StreamsDavey Shafik
 
[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And PortKeiichi Daiba
 
Erlang with Regexp Perl And Port
Erlang with Regexp Perl And PortErlang with Regexp Perl And Port
Erlang with Regexp Perl And PortKeiichi Daiba
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With PhpJeremy Coates
 
Ruby 程式語言簡介
Ruby 程式語言簡介Ruby 程式語言簡介
Ruby 程式語言簡介Wen-Tien Chang
 
R57php 1231677414471772-2
R57php 1231677414471772-2R57php 1231677414471772-2
R57php 1231677414471772-2ady36
 
Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....Raffi Krikorian
 
Cより速いRubyプログラム
Cより速いRubyプログラムCより速いRubyプログラム
Cより速いRubyプログラムkwatch
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Aslak Hellesøy
 

Similar a Out with Regex, In with Tokens (20)

My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertext
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...
 
Unsung Heroes of PHP
Unsung Heroes of PHPUnsung Heroes of PHP
Unsung Heroes of PHP
 
Impacta - Show Day de Rails
Impacta - Show Day de RailsImpacta - Show Day de Rails
Impacta - Show Day de Rails
 
Rack Middleware
Rack MiddlewareRack Middleware
Rack Middleware
 
LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6
 
recycle
recyclerecycle
recycle
 
JSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of ARJSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
 
Get Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP StreamsGet Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP Streams
 
[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port
 
Erlang with Regexp Perl And Port
Erlang with Regexp Perl And PortErlang with Regexp Perl And Port
Erlang with Regexp Perl And Port
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With Php
 
Ruby 程式語言簡介
Ruby 程式語言簡介Ruby 程式語言簡介
Ruby 程式語言簡介
 
Php 2
Php 2Php 2
Php 2
 
R57php 1231677414471772-2
R57php 1231677414471772-2R57php 1231677414471772-2
R57php 1231677414471772-2
 
Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....
 
Cより速いRubyプログラム
Cより速いRubyプログラムCより速いRubyプログラム
Cより速いRubyプログラム
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 

Último

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Último (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Out with Regex, In with Tokens

  • 1. Out With Regex, In With Tokens Sean Coates php|tek 2009
  • 2. Who is this Sean guy? • Web Architect at OmniTI (http://omniti.com/) • Former Editor-in-Chief of php|architect and former organizer of php|tek • PHP Community, Habari, Phergie • Other conferences (PHP Quebec earlier this year) • the Twitter: @coates • Beer Lover (and brewer) • (I speak too quickly)
  • 3. “A token is a categorized block of text. It can look like anything; it just needs to be a useful part of the structured text.” -Wikipedia
  • 4. $a = 5 + 7 ;
  • 5. $a = 5 + 7 ; (10 tokens)
  • 6. $a = 5 + 7 ; Whitespace
  • 7. $a = 5 + 7 ; Whitespace Variable
  • 8. $a = 5 + 7 ; Assign Whitespace Variable
  • 9. $a = 5 + 7 ; Number Assign Whitespace Variable
  • 10. $a = 5 + 7 ; Add Number Assign Whitespace Variable
  • 11. $a = 5 + 7 ; Add Number Assign Number Whitespace Variable
  • 12. $a = 5 + 7 ; Add Number Assign Number Whitespace Terminator Variable
  • 13. Grammar Matters $a = 5 + 7; // $b
  • 14. Grammar Matters $a = 5 + 7; // $b Not a Variable Variable
  • 15. Grammar Matters $a = 5 + 7; // $b Variable Comment
  • 16. PHP Example <?php $a = 5 + 7 ; // $b
  • 17. PHP Example T_OPEN_TAG <?php T_VARIABLE $a T_WHITESPACE = T_WHITESPACE T_LNUMBER 5 T_WHITESPACE + T_WHITESPACE T_LNUMBER 7 ; T_WHITESPACE T_COMMENT // $b
  • 18. “Lexing” • a Lexer converts a sequence of characters into tokens • “Lexical Analysis” • Lex, Flex, re2c (lexer generators)
  • 19. Static vs. Dynamic Analysis • Dynamic: actual execution, practical implementations such as pen. testing. • Static: analysis of code, tokens, opcodes, etc. to determine if a particular action will take place • (not the only use for Tokens, though)
  • 20. Out with Regex • Find all variables • Regex: /($[a-z_][a-z0-9_]*)/i
  • 21. Out with Regex • Find all variables • Regex: /($[a-z_][a-z0-9_]*)/i • context matters: $str = '$a = 5 + 7; // $b';
  • 22. Regex Fail <?php $str = '$a = 5 + 7; // $b'; preg_match_all( '/($[a-z_][a-z0-9_]*)/i', $str, $m ); var_dump($m[0]);
  • 23. Regex Fail array(2) { [0]=> string(2) quot;$aquot; [1]=> string(2) quot;$bquot; }
  • 24. Out with Regex • Find all variables RONG! • Regex: /($[a-z_][a-z0-9_]*)/i • context matters: $str = '$a = 5 + 7; // $b';
  • 25. Remember? $a = 5 + 7; // $b Variable Not a Variable!
  • 26. Token Approach <?php // look ma, no regex! $str = '<?php $a = 5 + 7; // $b'; foreach (token_get_all($str) as $t) { if (is_array($t) && $t[0] == T_VARIABLE) { echo $t[1] . quot;nquot;; } } // outputs: $a
  • 27. PHP Example (again) T_OPEN_TAG <?php T_VARIABLE $a T_WHITESPACE = T_WHITESPACE T_LNUMBER 5 T_WHITESPACE + T_WHITESPACE T_LNUMBER 7 ; T_WHITESPACE T_COMMENT // $b
  • 28. Regex can be complicated (email validation from MRE) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?! [^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^ x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn 015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80- xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80- xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80- xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80- xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80- xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80- xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^ x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000- 037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80- xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^ x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000- 037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80- xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^ x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
  • 29. Difficult validation made simpler • Email validation is haaaard! • Validate logical units separately: s e a n @ p h p. n e t
  • 30. Difficult validation made simpler • Email validation is haaaard! • Validate logical units separately: s e a n @ p h p. n e t Domain Localpart Separator
  • 31. Difficult validation made simpler • Email validation is haaaard! • Validate logical units separately: s e a n @ p h p. n e t • Still hard, but validation is restricted to different types of data • BTW, don’t bother (-:
  • 32. Regex can be complicated (email validation from MRE) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?! [^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^ x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn 015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80- xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80- xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80- xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* strpos($email, ‘@’) !== false ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80- xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80- xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80- xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^ x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000- 037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80- xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^ x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000- 037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80- xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^ x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
  • 33. Dirty Little Secret • Most tokenizers (lexers) use regular expressions to separate tokens • re2c • Multiple ways to represent separators, whitespace, etc.. simplified with regex
  • 34. Practical Uses • Compile source code • Simple, contextual replacement (e.g. BBCode) • Friendly line breaks • “Curly” quotes, special punctuation • Input validation/stripping • Refactoring
  • 35. PHP’s Tokenizer • Similar in other languages • Available (and useful!) in userspace • Built in to PHP (always available)
  • 36. PHP Execution • Lex • Parse • Compile • Execute • Cleanup
  • 37. PHP Execution • Lex • Parse Tokeny Goodness • Compile • Execute • Cleanup
  • 38. Tokenizer in Userspace • token_get_all() • token_name()
  • 39. Tokenizer in Userspace • token_get_all() returns an array of scalars and arrays • A bit hard to work with • Needs opening tag (<?php or <? depending on config)
  • 40. Tokenizer in Userspace (Example) print_r(token_get_all('<?php $a = 5 + 7; // $b')); Array [2] => Array [5] => Array [8] => Array [11] => Array ( ( ( ( ( [0] => Array [0] => 370 [0] => 305 [0] => 370 [0] => 370 ( [1] => [1] => 5 [1] => [1] => [0] => 367 [2] => 1 [2] => 1 [2] => 1 [2] => 1 [1] => <?php ) ) ) ) [2] => 1 ) [3] => = [6] => Array [9] => Array [12] => Array [4] => Array ( ( ( [1] => Array ( [0] => 370 [0] => 305 [0] => 365 ( [0] => 370 [1] => [1] => 7 [1] => // $b [0] => 309 [1] => [2] => 1 [2] => 1 [2] => 1 [1] => $a [2] => 1 ) ) ) [2] => 1 ) ) [7] => + [10] => ; )
  • 41. Tokenizer in Userspace (Example) [0] => Array ( [0] => 367 Token Number [1] => <?php Token Text [2] => 1 Line Number ) [1] => Array ( [0] => 309 token_name(309) [1] => $a == ‘T_VARIABLE’ [2] => 1 ) (...) [3] => = Scalar (not array)
  • 42. Practical Example: <pre> Simple Highlighter <?php $c = array( T_VARIABLE => 'red', T_LNUMBER => 'blue', ); foreach (token_get_all(fread(STDIN, 9999999)) as $t) { if (!is_array($t)) { echo htmlentities($t); } elseif (!isset($c[$t[0]])) { echo htmlentities($t[1]); continue; } else { echo '<span style=quot;color: ' . $c[$t[0]] . 'quot;>' . htmlentities($t[1]) . '</span>'; } } ?> </pre>
  • 43. Highlighter Output <?php $a = 5 + 7; // $b <pre> &lt;?php <span style=quot;color: redquot;>$a</span> = <span style=quot;color: bluequot;>5</span> + <span style=quot;color: bluequot;>7</span>; // $b </pre>
  • 44. Entities • Hi... I'm Sean
  • 45. Entities • Hi... I'm Sean • Hi&#8230; I&#8217;m Sean • Hi… I’m Sean
  • 46. Entities • Here's some code <code>$foo = 'bar';</code> • Here&#8217;some code
  • 47. Entities • Here's some code <code>$foo = 'bar';</code> • Here&#8217;some code <code>$foo = 'bar';</code> • Here’s some code <code>$foo = 'bar';</code>
  • 48. Entities • Here's some code <code>$foo = 'bar';</code> • Here&#8217;some code <code>$foo = 'bar';</code> • Here’s some code <code>$foo = 'bar';</code>
  • 49. Tokalizer • PHP token analysis wrapper • Object-oriented • Normalized • Includes a partial parser (in PHP, so it’s slow). Doesn’t work with new 5.3 constructs... yet. • http://github.com/scoates/tokalizer
  • 50. Context-aware tools • phpgrep regular grep: file.php:123: matched line php grep: file.php:123(foo::bar()): matched line
  • 51. Context-aware tools • diff-php regular diff: @@ -68,6 +68,7 @@ php diff: @@ -68,6 +68,7 @@ GeshiHighlighterFormatPlugin::do_highlight()
  • 52. Token dumps • text token dump • definition dump (*cough* currently broken) • html dump
  • 53. Habari’s HTML Tokenizer • Filter user input (can strip tags intelligently) • Allow plugins to inject/replace whole blocks of HTML without (developer-facing) regex • Facilitate autop, introspection
  • 54. HTMLPurifier • Intelligently filters/escapes potentially dangerous data • Token-based approach • Really difficult • Code is slow and memory-intensive, but it’s extremely complicated
  • 55.
  • 56. Questions? Contact... • http://seancoates.com/ • sean@php.net • http://omniti.com/is/sean-coates • IRC: scoates (Freenode and EFNet) • @coates on Twitter (if it happens to be up)