2. Who is this Sean guy?
• Web Architect at OmniTI (http://omniti.com/)
• Former Editor-in-Chief of php|architect and former
organizer of php|tek
• PHP Community, Habari, Phergie
• Other conferences (PHP Quebec earlier this year)
• the Twitter: @coates
• Beer Lover (and brewer)
• (I speak too quickly)
3. “A token is a
categorized block of
text. It can look like
anything; it just needs
to be a useful part of
the structured text.”
-Wikipedia
18. “Lexing”
• a Lexer converts a sequence of characters
into tokens
• “Lexical Analysis”
• Lex, Flex, re2c (lexer generators)
19. Static vs. Dynamic
Analysis
• Dynamic: actual execution, practical
implementations such as pen. testing.
• Static: analysis of code, tokens, opcodes,
etc. to determine if a particular action will
take place
• (not the only use for Tokens, though)
20. Out with Regex
• Find all variables
• Regex:
/($[a-z_][a-z0-9_]*)/i
21. Out with Regex
• Find all variables
• Regex:
/($[a-z_][a-z0-9_]*)/i
• context matters:
$str = '$a = 5 + 7; // $b';
29. Difficult validation
made simpler
• Email validation is haaaard!
• Validate logical units separately:
s e a n @ p h p. n e t
30. Difficult validation
made simpler
• Email validation is haaaard!
• Validate logical units separately:
s e a n @ p h p. n e t
Domain
Localpart Separator
31. Difficult validation
made simpler
• Email validation is haaaard!
• Validate logical units separately:
s e a n @ p h p. n e t
• Still hard, but validation is restricted to
different types of data
• BTW, don’t bother (-:
33. Dirty Little Secret
• Most tokenizers (lexers) use regular
expressions to separate tokens
• re2c
• Multiple ways to represent separators,
whitespace, etc.. simplified with regex
39. Tokenizer in Userspace
• token_get_all() returns an array of scalars
and arrays
• A bit hard to work with
• Needs opening tag (<?php or <? depending
on config)
49. Tokalizer
• PHP token analysis wrapper
• Object-oriented
• Normalized
• Includes a partial parser (in PHP, so it’s
slow). Doesn’t work with new 5.3
constructs... yet.
• http://github.com/scoates/tokalizer
52. Token dumps
• text token dump
• definition dump (*cough* currently broken)
• html dump
53. Habari’s HTML
Tokenizer
• Filter user input (can strip tags intelligently)
• Allow plugins to inject/replace whole
blocks of HTML without (developer-facing)
regex
• Facilitate autop, introspection
54. HTMLPurifier
• Intelligently filters/escapes potentially
dangerous data
• Token-based approach
• Really difficult
• Code is slow and memory-intensive, but it’s
extremely complicated