compilers-aho: update 03-lexical-analysis.md

- Recognition of Tokens

Prithu Goswami prithugoswami524@gmail.com

Thu, 02 Apr 2020 16:50:50 +0530

commit

eb1c1110d68a759aa46142f0265ddd92439ce5e3

parent

40b4145dc908df57118b4daf581f5e2c25f62516

11 files changed, 104 insertions(+), 0 deletions(-)

jump to

compilers-aho/03-lexical-analysis.md

compilers-aho/img/grammar-brancing-stmts.png

compilers-aho/img/patterns-tokens.png

compilers-aho/img/recognition-id-res-1.png

compilers-aho/img/recognition-id-res-2.png

compilers-aho/img/relop-transition.png

compilers-aho/img/simplification-number.png

compilers-aho/img/simplificiation-cid.png

compilers-aho/img/table-patterns-tokens.png

compilers-aho/img/transition-number.png

compilers-aho/img/unsigned-number-example.png

M compilers-aho/03-lexical-analysis.md → compilers-aho/03-lexical-analysis.md

@@ -224,3 +224,107 @@ $d_1$ and $d_2$ in $r_3$ by $r_1$ and (the substituted) $r_2$ and so on.
 
 ![C identifiers regular definition exampele](img/cid-reg-definitions.png){width=40%}
 
+![Unsigned numbers (integer or floating point) are strings such as 5238, 0.0123, 6.33E4, or 1.89E-4](img/unsigned-number-example.png){width=60%}
+
+## Extensions of Regular Expressions
+
+extensions enhance the ability of regular expressions to be able to specify
+string patterns
+
+1. **One or more instances**. Unary postfix operator $^+$ represents the
+   positive closure of a regular expression and its language.  Relation between
+   Kleene closure and positive closure is: $r^*=r^+ | \epsilon$ and $r^+ = rr^*
+   = r^*r\*$
+1. **Zero or one instance**. Unary postfix operator $?$ means "zero or one
+   occurrence". $r?$ is equivalent to $r|\epsilon$
+3. **Character classes**. A regular expression $a_1 | a_2 | ... | a_n$ can be
+   replaced by the shorthand $[a_1 a_2 ... a_n]$. When $a_1,a_2 ... a_n$ are a
+   logical sequence then they can be replaced by $[a_1 - a_n]$. Like the
+   sequence of numbers or alphabets. [a,b,c] is a shorthand for a|b|c, and
+   [a-z] is a shorthand for a|b|c|...|z.
+
+**Simplification of the previous two examples**:
+
+![](img/simplificiation-cid.png){width=35%}
+
+![](img/simplification-number.png){width=45%}
+
+\pagebreak
+
+# Recognition of Tokens
+
+**Example: Grammar for branching statements**
+
+![](img/grammar-brancing-stmts.png){width=42%}
+
+Ther terminals - **if, then, else, relop, id** and **number** are the names of
+the tokens and their patterns are described using regular definitions below
+
+**Patterns for tokens**
+
+![](img/patterns-tokens.png){width=55%}
+
+**Tokens, their patterns and attribute values for the example**
+
+![](img/table-patterns-tokens.png){width=60%}
+
+## Transition Diagrams
+
+We make transition diagrams from the regular expressions. Some conventions
+about transition diagrams:
+
+1. Certain states are said to be *accepting,* or *final*. These states indicate
+   that a lexeme has been found, although the actual lexeme may not consist of
+   all positions between the *lexemeBegin* and *forward* pointers. We always
+   indicate an accepting state by a double circle, and if there is an action to
+   be taken -- typically returning a token and an attribute value to the parser
+   -- we shall attach that action tot he accepting state.
+2. In addition, if it is necessary to retract the *forward* pointer one
+   position (i.e., the lexeme does not include the symbol that got us to the
+   accepting state), then we shall additionally place a $^*\*$ near that
+   accepting state. If there are multiple number of retractions required then
+   we can attach any number of $^*\*$'s to the accepting state.
+3. One state is designated the *start state*, or *initial state*; it is
+   indicated by an adge, labeled "start," entering from nowhere. The transitio
+   diagram always begins in the start state before any input symobols have been
+   read.
+
+![Transition diagram for **relop**](img/relop-transition.png){width=56%}
+
+\pagebreak
+
+## Recognition of Reserved Words and Identifiers
+
+There are two methods discussed:
+
+1. We can already have entries in the symbol table for the reserved words and
+   then have the same transition diagram for both **id** and reserved word
+   tokens as they are the same. And when we get a token accepted we check if
+   it's there in the symbol table already - if it is, then we know that it is a
+   reserved word. We consult the symbol table basically. If an actual **id**
+   token is accepted and we find that it's already in the symbol table then we
+   check the token name in the symbol table - which would be **id** from the
+   previous time that token was encountered.
+
+![Transition diagram for **id** and keyword](img/recognition-id-res-1.png){width=72%}
+
+2. We can have a separate transition diagram for every reserved keyword and run
+   it. For this every single character of the keyword must be checked one by
+   one changing states.
+
+![Hypothetical transition diagram for the keyword `then`](img/recognition-id-res-2.png){width=80%}
+
+
+## Completion of the Running Example
+
+We can then now make a transition diagram for **number** token similarly.
+
+![](img/transition-number.png){width=85%}
+
+
+
+## Architecture of a Transition-Diagram-Based Lexical Analyzer
+
+Transition Diagrams can be implemented as switch cases: (page 158)
+- We can run the transition in parallel or one by one in a sequence or just
+  combine all the transition diagrams into one transition diagram

A compilers-aho/img/grammar-brancing-stmts.png

A compilers-aho/img/patterns-tokens.png

A compilers-aho/img/recognition-id-res-1.png

A compilers-aho/img/recognition-id-res-2.png

A compilers-aho/img/relop-transition.png

A compilers-aho/img/simplification-number.png

A compilers-aho/img/simplificiation-cid.png

A compilers-aho/img/table-patterns-tokens.png

A compilers-aho/img/transition-number.png

A compilers-aho/img/unsigned-number-example.png

all repos — notes @ eb1c1110d68a759aa46142f0265ddd92439ce5e3

My notes, written in md and LaTeX