COMS W4115
Programming Languages and Translators
Lecture 5: Lexical Analysis, February 6, 2008
Lecture Outline
- Review
- The lexical analyzer
- Basic definitions from language theory
- Regular expressions
- Reading
1. Review
- Overview of compilation
- Front end: analysis
- Back end: synthesis
- IR: Intermediate representation(s)
- Phases
- lexical analyzer (scanner)
- syntax analyzer (parser)
- semantic analyzer
- intermediate code generator
- machine-independent code optimizer
- code generator
- machine-dependent code optimizer
- Symbol table
- Error handler
- Project teams should now be formed.
2. The Lexical Analyzer
- The first phase of the compiler is the lexical analyzer (lexer).
- It reads the stream of characters making up the source
program and groups the characters into logically meaningful sequences called lexemes.
- For each lexeme the lexer sends to the parser a token of the
form <token-name, attribute-value>.
- For a token such as an identifier, the lexer will make an entry into
the symbol table in which it stores attributes such as
the lexeme and type associated with the token.
- The lexer will also strip out whitespace and expand macros.
3. Basic Definitions from Language Theory
- Symbol (character, letter)
- Alphabet: a finite nonempty set of characters
- Examples: {0, 1}, ASCII, Unicode
- String (sentence, word): a finite sequence of characters, possibly empty.
- Language: a (countable) set of strings, possibly empty.
- Operations on strings
- concatenation
- exponentiation
- x0 is the empty string ε.
- xi = xi-1x, for i > 0
- prefix, suffix, substring, subsequence
- Operations on languages
- union
- concatenation
- exponentiation
- L0 is the empty set.
- Li = Li-1L, for i > 0
- Kleene closure
4. Regular Expressions
- A regular expression is a notation for specifying a set of strings.
- Many of today's programming languages use regular expressions to match
patterns in strings.
- E.g., awk, flex, lex, java, javascript, perl, python
- Today regular expressions come many different forms.
- The earliest and simplest are the Kleene regular expressions: See ALSU, Sect. 3.3.3.
- Awk and egrep extended grep's regular expressions with union and parenthesis.
- POSIX attempted to standardize Unix regular expressions.
- Perl has an amazingly rich set of regular expression operators.
- Python uses pcre regular expressions.
- Lex regular expressions
- The lexical analyzer generators flex and lex use extended regular expressions
to specify lexeme patterns making up tokens: See ALSU, Fig. 3.8, p. 127.
- See
The Lex & Yacc Page
for lex and flex tutorials and manuals.
5. Reading Assignment
- Read Chapter 3, all sections except 3.9.
aho@cs.columbia.edu