Lexical analysis involves breaking input text into tokens. It is implemented using regular expressions to specify token patterns and a finite automaton to recognize tokens in the input stream. Lex is a tool that allows specifying a lexical analyzer by defining regular expressions for tokens and actions to perform on each token. It generates code to simulate the finite automaton for token recognition. The generated lexical analyzer converts the input stream into tokens by matching input characters to patterns defined in the Lex source program.
5. Tokens -> RE
• Specification of Tokens: The Patterns corresponding
to a token are generally specified using a compact
notation known as regular expression.
• Regular expressions of a language are created by
combining members of its alphabet.
• A regular expression r corresponds to a set of strings
L(r) where L(r) is called a regular set or a regular
language and may be infinite.
6. RegEx
A regular expression is defined as follows:-
• A basic regular expression a denotes the set {a}
where a ∈Σ; L(a) ={a}
• The regular expression ε denotes the set {ε}
• If r and s are two regular expressions denoting the
sets L(r) and L(s) then; following are some rules for
regular expressions
9. Token Recognition
• Finite Automata are recognizers that can identify the
tokens occurring in input stream.
• Finite Automaton (FA) consists of:
– A finite set of states
– A set of transitions (or moves) between states:
– A special start state
– A set of final or accepting states
14. Input Buffering
• Buffer Pairs: Due to amount of time taken to process
characters and the large number of characters that
must be processed.
• Specialized buffering techniques are developed to
reduce the amount of overhead required to process a
single input character.
15. Language to Specify LA
• Lex, allows one to specify a lexical analyzer by
specifying regular expressions to describe patterns for
tokens.
• The input notation for the Lex tool is referred to as the
Lex language and the tool itself is the Lex compiler.
• Behind the scenes, the Lex compiler transforms the
input patterns into a transition diagram and
generates code, in a file called lex.yy.c, that simulates
this transition diagram.
17. Language to Specify LA
In simple words:-
• LEX converts Lex source program to Lexical
analyzer.
• Lexical Analyzer converts input stream into tokens.
18. Lex Source Program - Structure
A Lex program has the following form:-
• Auxiliary Definitions – it denotes Res of the form
D1 = R1 //Di is the shortcut name for RE
D2 = R2 //Ri is the RE
• Translation Rules – Rules that tell LA which action
to take upon encountering these tokens.
19. Lex Source Program - Structure
Auxiliary Definitions:-
• D1 (letter) = A | B | C…| Z | a | b…| z (R1)
• D2 (digit) = 0 | 1| 2….. | 9 (R2)
• D3 (identifier) = letter (letter | digit)*
• D4 (integer) = digit+
• D5 (sign) = + | -
• D6 (signed-integer) = sign integer
20. Lex Source Program - Structure
Translation Rules:-
• The translation rules each have form: Pi {Actioni}
• Each pattern is a regular expression, which may use
the regular definitions of the declaration section.
• The actions are fragments of code , typically written
in C.
• Ex: for ‘keyword’-> begin {return 1}
• Ex: for ‘variable’-> identifier {install(); return 6}
24. Implementation of LA
• Lex generates LA as its o/p by taking Lex program
as i/p.
• Lex program is collection of patterns (REs) and their
corresponding actions.
• Patterns represent tokens to be recognized by LA to
be generated.
• For each pattern, corresponding NFA will be
designed.
25. Implementation of LA
• There can be ‘n’ no. of NFAs for ‘n’ no. of patterns.
• A start state is taken & using ε-transitions, all NFAs
are combined.
• The final state of each NFA show that it has found
its own token Pi.
• Convert NFA into DFA.
• The final state of DFA shows the token we have
found.
26. Implementation of LA
• If none of the states of DFA include any final states
of NFA, then an error is reported.
• If final state of DFA includes more than one final
state of NFA, then final state for pattern coming first
in transition rule has priority.
************ -----------------************