This document discusses syntax analysis in compiler design. It begins by explaining that the lexer takes a string of characters as input and produces a string of tokens as output, which is then input to the parser. The parser takes the string of tokens and produces a parse tree of the program. Context-free grammars are introduced as a natural way to describe the recursive structure of programming languages. Derivations and parse trees are discussed as ways to parse strings based on a grammar. Issues like ambiguity and left recursion in grammars are covered, along with techniques like left factoring that can be used to transform grammars.
1. COMPILER DESIGN
SYNTAX ANALYSIS
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 1
Ms. RICHA SHARMA
Assistant Professor
richa.18364@lpu.co.in
Lovely Professional
University
2. SYNTAX ANALYSIS INTRODUCTION
• LEXICAL PHASE IS IMPLEMENTED ON FINITE AUTOMATA & FINITE AUTOMATA CAN REALLY
ONLY EXPRESS THINGS WHERE YOU CAN COUNT MODULUS ON K.
• REGULAR LANGUAGES – THE WEAKEST FORMAL LANGUAGES WIDELY USED
– MANY APPLICATIONS
– CAN’T HANDLE ITERATION & NESTED LOOPS(NESTED IF ELSE ).
TO SUMMARIZE, THE LEXER TAKES A STRING OF CHARACTER AS INPUT AND PRODUCES A STRING
OF TOKENS AS OUTPUT.
THAT STRING OF TOKENS IS THE INPUT TO THE PARSER WHICH TAKES A STRING OF TOKENS AND
PRODUCES A PARSE TREE OF THE PROGRAM.
SOMETIMES THE PARSE TREE IS ONLY IMPLICIT. SO THE, A COMPILER MAY NEVER ACTUALLY
BUILD THE FULL PARSE
TREE.
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 2
4. CONTEXT FREE GRAMMARS
expression -> expression + term
expression -> expression – term
expression -> term
term -> term * factor
term -> term / factor
term -> factor
factor -> (expression)
factor -> id
S - IS A FINITE SET OF TERMINALS
N - IS A FINITE SET OF NON-TERMINALS
P - IS A FINITE SUBSET OF PRODUCTION
RULES
S - IS THE START SYMBOL
G=(S ,N,P,S)
• A GRAMMAR DERIVES STRINGS BY BEGINNING WITH START SYMBOL AND
REPEATEDLY REPLACING A NON TERMINAL BY THE RIGHT HAND SIDE OF A
PRODUCTION FOR THAT NON TERMINAL.
• FROM THE START SYMBOL OF A GRAMMAR G FORM THE LANGUAGE L(G)
DEFINED BY THE GRAMMAR THE STRINGS THAT CAN BE DERIVED .
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 4
5. RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 5
• PROGRAMMING LANGUAGES HAVE RECURSIVE STRUCTURE
• CONTEXT-FREE GRAMMARS ARE A NATURAL NOTATION FOR THIS
RECURSIVE STRUCTURE .
NOT ALL STRINGS OF TOKENS ARE PROGRAMS . . .
. . . PARSER MUST DISTINGUISH BETWEEN VALID AND INVALID
STRINGS OF TOKENS
WE NEED :
– A LANGUAGE :FOR DESCRIBING VALID STRINGS OF TOKENS
– A METHOD: FOR DISTINGUISHING VALID FROM INVALID STRINGS OF
TOKENS
CONTEXT FREE GRAMMARS
6. RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY)
E ::= T | E + T | E - T
T ::= F | T * F |T / F
F ::= id | (E)
• ARITHMETIC EXPRESSIONS
• STATEMENTS
If Statement ::= if E then Statement else Statement
CONTEXT FREE GRAMMAR EXAMPLES
Steps:
1. Begin with a string with only the start
symbol S
2. Replace any non-terminal X in the
string by the right-hand side of some
production
X -> Y1…Yn
3. Repeat (2) until there are no non-
terminals
6
7. DERIVATIONS
• DERIVATION IS A SEQUENCE OF PRODUCTIONS SO BEGINNING WITH THE START SYMBOL.
• WE CAN APPLY PRODUCTIONS ONE AT A TIME IN SEQUENCE & THAT WILL PRODUCES A
DERIVATION.
• A DERIVATION IS A SEQUENCE OF PRODUCTIONS
A -> … -> … ->… -> … -> …
• A DERIVATION CAN BE DRAWN AS A TREE
– START SYMBOL IS THE TREE’S ROOT
– FOR A PRODUCTION X -> Y1…Yn ADD CHILDREN Y1…Yn TO NODE X
• GRAMMAR
E -> E + E | E * E | (E) | ID
• STRING
ID *ID + ID
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 7
8. DERIVATIONS
DERIVATIONS ARE OF TWO TYPES:
• RIGHTMOST AND LEFTMOST DERIVATIONS
• LETS DISCUSS WITH EXAMPLE
GRAMMAR: E -> E + E | E * E | -E | (E) | ID
STRING :(ID+ID)
LEFT MOST DERIVATION RIGHT MOST DERIVATION
E E
= (E) = (E)
= (E+E) = (E+E)
= (ID+E) = (E+ID)
=(ID+ID) =(ID+ID)
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 8
9. DERIVATIONS
• NOW WE'RE GOING TO PARSE THIS STRING AND WE'RE GOING TO SHOW HOW TO
PRODUCE A DERIVATION FOR THE STRING AND ALSO AT THE SAME TIME BUILD
THE TREE.
• PARSE TREES HAVE TERMINALS AT THE LEAVES AND NONTERMINALS AT THE
INTERIOR NODES AND FURTHERMORE, IN-ORDER TRAVERSAL OF THE LEAVES IS
THE ORIGINAL INPUT.
• GRAMMAR
E -> E + E | E * E | (E) | ID
• STRING
ID * ID + ID
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 9
10. LEFT MOST DERIVATION AND PARSE TREE
E
E
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 10
11. LEFT MOST DERIVATION AND PARSE TREE
E
E+E E
E + E
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 11
12. LEFT MOST DERIVATION AND PARSE TREE
E
E+E E
E*E+E E + E
E * E
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 12
13. LEFT MOST DERIVATION AND PARSE TREE
E
E+E E
E*E+E E + E
id*E+E E * E
id
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 13
14. LEFT MOST DERIVATION AND PARSE TREE
E
E+E E
E*E+E E + E
id*E+E E * E
id*id+E id id
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 14
15. LEFT MOST DERIVATION AND PARSE TREE
E
E+E E
E*E+E E + E
id*E+E E * E id
id*id+id id id
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 15
16. DERIVATIONS
• A PARSE TREE HAS
– TERMINALS AT THE LEAVES
– NON-TERMINALS AT THE INTERIOR NODES
• AN IN-ORDER TRAVERSAL OF THE LEAVES IS THE ORIGINAL INPUT
• THE PARSE TREE SHOWS THE ASSOCIATION OF OPERATIONS, THE INPUT STRING
DOES NOT .
NOTE: THAT RIGHT-MOST AND LEFT-MOST DERIVATIONS HAVE THE SAME PARSE
TREE IF NOT THEN THE GRAMMAR IS AMBIGUOUS GRAMMAR.
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 16
17. AMBIGUITY
• IF STRING HAS TWO OR MORE RIGHT MOST DERIVATIONS OR TWO OR MORE
LEFT DERIVATIONS THEN THAT STRING WILL HAVE TWO DISTINCT PARSE TREES
AND HENCE GRAMMAR WILL BE AMBIGUOUS.
• AMBIGUITY IS BAD: LEAVES MEANING OF SOME PROGRAMS ILL-DEFINED
• MULTIPLE PARSE TREES FOR SOME PROGRAM THEN THAT ESSENTIALLY MEANS
THAT YOU'RE LEAVING IT UP TO THE COMPILER TO PICK WHICH OF THOSE TWO
POSSIBLE INTERPRETATIONS OF THE PROGRAM YOU WANT IT TO GENERATE
CODE FOR AND THAT'S NOT A GOOD IDEA.
• TO REMOVE AMBIGUITY WE NEED TO REWRITE THE RULES CHECKING OVER
PRECEDENCE AND ASSOCIATIVITY .
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 17
18. AMBIGUITY
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 18
Eg: The string id +id* id produces two parse tree hence the grammar is ambiguous.
One can remove the ambiguity by rewriting the grammar as introducing new non-terminal instead of r
non-terminal , but it can result in left or right recursion .Hence we have to remove left recursion.
19. AMBIGUITY
• IF WE HAVE AN AMBIGUOUS GRAMMAR:
E →E * E
E →NUM
• AS THIS DEPENDS ON THE ASSOCIATIVITY OF *,WE USE DIFFERENT REWRITE
RULES FOR DIFFERENT ASSOCIATIVITY .
• IF * IS LEFT-ASSOCIATIVE, WE MAKE THE GRAMMAR LEFT-RECURSIVE BY HAVING
A RECURSIVE REFERENCE TO THE LEFT ONLY OF THE OPERATOR SYMBOL.
UNAMBIGUOUS GRAMMAR: E →E * E’
E →E’
E’→NUM
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 19
20. LEFT RECURSION
• UNAMBIGUOUS GRAMMAR : E →E * E’
E →E’
E’→NUM
• THIS GRAMMAR IS NOW LEFT RECURSIVE. LEFT RECURSIVE GRAMMAR IS ANY
GRAMMAR THAT HAS A NON-TERMINAL WHERE IF YOU START WITH THAT NON-
TERMINAL AND YOU DO SOME NON-EMPTY SEQUENCE OF RE-WRITES.
• CONSIDER THE LEFT-RECURSIVE GRAMMAR S -> S a | b
• S GENERATES ALL STRINGS STARTING WITH “a” AND FOLLOWED BY ANY
NUMBER OF “b’S”
• CAN REWRITE USING RIGHT-RECURSION
• S ->bS’
S’ ->aS’ |€
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY)
20
21. EXAMPLES OF LEFT RECURSION
1. E -> E + T | T T -> ID | (E)
2. S ->(L)|X L ->L,S|S
3. S ->S0S1S|01
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY)
21
22. LEFT FACTORING
• LEFT FACTORING IS A GRAMMAR TRANSFORMATION THAT IS USEFUL
FOR PRODUCING A DETERMINISTIC GRAMMAR FROM NON-
DETERMINISTIC GRAMMAR SUITABLE FOR PREDICTIVE OR TOP-DOWN
PARSING.
• CONSIDER FOLLOWING GRAMMAR:
• STMT -> IF EXPR THEN STMT ELSE STMT
• | IF EXPR THEN STMT
• ON SEEING INPUT IF IT IS NOT CLEAR FOR THE PARSER WHICH
PRODUCTION TO USE
• WE CAN EASILY PERFORM LEFT FACTORING:
• IF WE HAVE A->ΑΒ1 | ΑΒ2 THEN WE REPLACE IT WITH
• A -> ΑA’
• A’ -> Β1 | Β2
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY)
22
23. EXAMPLES OF LEFT FACTORING
1. S -> iEtS|iEtSES|a E ->b
2. S-> aSSbS|aSaSb|abb|b
3. S-> bSSaaS|bSSaSb|bSb|a
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY)
23