Understanding
Programming Language
Part 1

IDE Development Course

Andrew Vasilyev

Today's Agenda

  • Finite State Automatons
  • Formal languages
  • Abstract Syntax Trees
  • Lexical Analysis
  • Syntax Analysis

Programming Language Processing in IDE

						%%{
							init: {
								'theme': 'base',
								'themeVariables': {
									'fontSize': '35px',
									'darkmode': true,
									'lineColor': '#F8B229'
								}
							}
						}%%
						graph LR
						A[Code] --> B[???]
						B --> C[PSI]
					

Finite State Automatons (FSA)

It's a mathematical model of computation. It represents a system with a limited number of states, transitions between those states, and a starting state. FSAs are used in parsing, lexical analysis, and many other tasks.

Finite State Automatons (FSA)

  • States: Different configurations the FSA can be in.
  • Transitions: Movements between states based on inputs.
  • Inputs & Outputs: Stream of symbols the FSA processes and the results.
								%%{
									init: {
										'theme': 'base',
										'themeVariables': {
											'fontSize': '24px',
											'darkmode': true,
											'lineColor': '#F8B229'
										}
									}
								}%%
								graph TD
								A((start)) -->|c| B[c]
								B -->|h| C[ch]
								B -->|a| H[ca]
								C -->|a| D[cha]
								C -->|e| E[che]
								D -->|t| G(chat)
								E -->|a| F[chea]
								F -->|t| I(cheat)
								H -->|t| J(cat)	
								G --> Z(end)
								I --> Z
								J --> Z						
							
"chat", "cheat" or "cat"?

Implementing FSAs - switch/when

					
						fun recognize(input: String): Boolean {
							enum class State {
								START, STATE1, STATE2, STATE3, STATE4, STATE5, STATE6, END
							}
							var currentState = State.START
							for (char in input) {
								currentState = when (currentState) {
									State.START -> when (char) {
										'c' -> State.STATE1
										else -> return false
									}
									State.STATE1 -> when (char) {
										'h' -> State.STATE2
										'a' -> State.STATE3
										else -> return false
									}
									State.STATE2 -> when (char) {
										'a' -> State.STATE4
										'e' -> State.STATE5
										else -> return false
									}
									State.STATE3 -> when (char) {
										't' -> State.END
										else -> return false
									}
									State.STATE4 -> when (char) {
										't' -> State.END
										else -> return false
									}
									State.STATE5 -> when (char) {
										'a' -> State.STATE6
										else -> return false
									}
									State.STATE6 -> when (char) {
										't' -> State.END
										else -> return false
									}
									State.END -> return false
								}
							}
							return currentState == State.END
						}	
					
				

Implementing FSAs - recursive descent

					
						fun recognize(input: String): Boolean {
							val chars = input.toCharArray()
							var position = 0
						
							fun accept(c: Char): Boolean {
								if (position < chars.size && chars[position] == c) {
									position++
									return true
								}
								return false
							}
						
							fun stateStart(): Boolean {
								return accept('c') && state1()
							}
						
							fun state1(): Boolean {
								return when {
									accept('h') -> state2()
									accept('a') -> state3()
									else -> false
								}
							}
						
							fun state2(): Boolean {
								return when {
									accept('a') -> state4()
									accept('e') -> state5()
									else -> false
								}
							}
						
							fun state3(): Boolean {
								return accept('t')
							}
						
							fun state4(): Boolean {
								return accept('t')
							}
						
							fun state5(): Boolean {
								return accept('a') && state6()
							}
						
							fun state6(): Boolean {
								return accept('t')
							}
						
							return stateStart() && position == chars.size
						}
						
						fun main() {
							println(recognize("chat"))   // true
							println(recognize("cheat"))  // true
							println(recognize("cat"))    // true
						}
					
				

Implementing FSAs - State pattern

					
						interface State {
							fun next(c: Char): State
							fun isEnd(): Boolean = false
						}
						
						object StartState : State {
							override fun next(c: Char) = if (c == 'c') State1 else this
						}
						
						object State1 : State {
							override fun next(c: Char) = when (c) {
								'h' -> State2
								'a' -> State3
								else -> this
							}
						}
						
						object State2 : State {
							override fun next(c: Char) = when (c) {
								'a' -> State4
								'e' -> State5
								else -> this
							}
						}
						
						object State3 : State {
							override fun next(c: Char) = if (c == 't') EndState else this
						}
						
						object State4 : State {
							override fun next(c: Char) = if (c == 't') EndState else this
						}
						
						object State5 : State {
							override fun next(c: Char) = if (c == 'a') State6 else this
						}
						
						object State6 : State {
							override fun next(c: Char) = if (c == 't') EndState else this
						}
						
						object EndState : State {
							override fun next(c: Char) = this
							override fun isEnd() = true
						}
						
						fun recognize(input: String): Boolean {
							var currentState: State = StartState
							for (char in input) {
								currentState = currentState.next(char)
							}
							return currentState.isEnd()
						}
					
				

Implementing FSAs - state transition table

					
						| Current State | 'c'  | 'h'  | 'a'  | 'e'  | 't'  | any other |
						|---------------|------|------|------|------|------|-----------|
						| StartState    | S1   | -    | -    | -    | -    | -         |
						| S1            | -    | S2   | S3   | -    | -    | -         |
						| S2            | -    | -    | S4   | S5   | -    | -         |
						| S3            | -    | -    | -    | -    | End  | -         |
						| S4            | -    | -    | -    | -    | End  | -         |
						| S5            | -    | -    | S6   | -    | -    | -         |
						| S6            | -    | -    | -    | -    | End  | -         |
						| EndState      | -    | -    | -    | -    | -    | -         |
					
				

Implementing FSAs - state transition table

					
						enum class State {
							StartState, S1, S2, S3, S4, S5, S6, EndState
						}
						
						fun recognize(input: String): Boolean {
							val transitionTable: Map> = mapOf(
								State.StartState to mapOf('c' to State.S1),
								State.S1 to mapOf('h' to State.S2, 'a' to State.S3),
								State.S2 to mapOf('a' to State.S4, 'e' to State.S5),
								State.S3 to mapOf('t' to State.EndState),
								State.S4 to mapOf('t' to State.EndState),
								State.S5 to mapOf('a' to State.S6),
								State.S6 to mapOf('t' to State.EndState)
							)
						
							var currentState = State.StartState
							for (char in input) {
								currentState = transitionTable[currentState]?.get(char) ?: return false
							}
							return currentState == State.EndState
						}				
					
				

Formal Languages and Regular Grammars

Definitions

Formal languages provide a structured way to represent and process information.

Formal language is a set of strings over a fixed alphabet of symbols. The rules for combining these strings are governed by grammars.

Consider a formal language over the alphabet {0, 1} that consists of all strings where every 0 is immediately followed by a 1. Examples of strings in this language include "1", "11", "101", and "1101". Strings like "10", "100", and "1100" are not in this language.

Definitions

String - a finite sequence of symbols chosen from a set called an alphabet. For instance, "1101" is a string over the alphabet {0, 1}.

Alphabet - a finite set of symbols. These symbols are the basic building blocks that can be used to construct strings in a formal language. For example, the binary alphabet is {0, 1}.

Symbol - a basic unit in an alphabet. It's an individual character or element that can be used to construct strings. In the English alphabet, "A", "B", and "C" are symbols. In the binary alphabet, 0 and 1 are symbols.

Formal grammar

Formal grammar is a set of production rules for strings in a formal language. The grammar describes how to form strings from the language's alphabet that are valid according to the language's syntax. These rules determine how symbols can be combined to produce valid strings.

A grammar is said to be "context-free" if the production rules are applied regardless of the surrounding context of the symbols. This is in contrast to context-sensitive grammars, where the application of rules can depend on the surrounding symbols.

Grammar of formal language - Example

S → 1S | 01S | ε

S is a non-terminal symbol.
1, 0, and ε (empty string) are terminal symbols.

The grammar says that a string S can be replaced by:
A "1" followed by another string S.
A "01" followed by another string S.
An empty string ε.

With these rules, we can derive strings like "1", "11", "101", "1101", etc., but not strings like "10" or "100".

								%%{
									init: {
										'theme': 'base',
										'themeVariables': {
											'fontSize': '24px',
											'darkmode': true,
											'lineColor': '#F8B229'
										}
									}
								}%%
								graph TD
								A((start)) -->|1| A
								A -->|0| B
								B -->|1| A	
								B -->|0| C[error]
								C -->|1| C
								C -->|0| C					
							

Lexical Analysis

Example

T={ a-z , 0-9 }
S → Identifier | Constant
Identifier → Char AlphaNums ε
Constant → Digits ε
Digits → Digit | Digit Digits
AlphaNums → CharOrDigit | CharOrDigit AlphaNums
CharOrDigit → a-z | 0-9

								%%{
									init: {
										'theme': 'base',
										'themeVariables': {
											'fontSize': '24px',
											'darkmode': true,
											'lineColor': '#F8B229'
										}
									}
								}%%
								graph TD
									Start((Start)) -->|a-z| IdentifierStart
									Start -->|0-9| ConstantStart
									IdentifierStart -->|a-z, 0-9| IdentifierContinue
									IdentifierContinue -->|a-z, 0-9| IdentifierContinue
									ConstantStart -->|0-9| ConstantContinue
									ConstantContinue -->|0-9| ConstantContinue				
							

Lexical Analyzer

Tokens are sequences of characters with a collective meaning. Examples include keywords, identifiers, operators, and literals.

Lexical analisys is process of translating input stream to a stream of meaningful tokens.

Lexical analyzer (or lexer) is automata that recognizes tokens in a stream of data and produces stream of meaningful tokens.

To implement lexer: - Define grammar to describe tokens - Build FSA inputting stream of characters and outputting stream of tokens

Example

					
						sealed class Token
						data class IdentifierToken(val name: String) : Token()
						data class ConstantToken(val value: Int) : Token()
						
						class Lexer(private val input: String) {
							private var currentIndex = 0
						
							fun tokenize(): List {
								val tokens = mutableListOf()
								skipWhitespace()
								while (currentIndex < input.length) {
									when (val char = peek()) {
										in '0'..'9' -> tokens.add(constant())
										in 'a'..'z' -> tokens.add(identifier())
										else -> throw IllegalArgumentException("Unexpected character '$char' at position $currentIndex.")
									}
									skipWhitespace()
								}
								return tokens
							}
						
							private fun peek(): Char = if (currentIndex < input.length) input[currentIndex] else 0.toChar()
							private fun nextChar(): Char = if (currentIndex < input.length) input[currentIndex++] else 0.toChar()
							private fun skipWhitespace() {
								while (currentIndex < input.length && peek().isWhitespace()) {
									currentIndex++
								}
							}
						
							private fun constant(): ConstantToken {
								val sb = StringBuilder()
								while (currentIndex < input.length && Character.isDigit(peek())) {
									sb.append(nextChar())
								}
								return ConstantToken(sb.toString().toInt())
							}
						
							private fun identifier(): IdentifierToken {
								val sb = StringBuilder()
								while (currentIndex < input.length && (Character.isDigit(peek()) || Character.isLowerCase(peek()))) {
									sb.append(nextChar())
								}
								return IdentifierToken(sb.toString())
							}
						}					
					
				

Syntax Analysis

Parse Tree (or concrete syntax tree)

A parse tree, also known as a concrete syntax tree, is a tree representation that reflects the syntactic structure of a string according to some formal grammar.

Parse Tree (or concrete syntax tree)

E → E + E
E → E * E
E → ( E )
E → number

Rules are dependent on surrounding symbols, so grammar is context-dependant!

								%%{
									init: {
										'theme': 'base',
										'themeVariables': {
											'fontSize': '24px',
											'darkmode': true,
											'lineColor': '#F8B229'
										}
									}
								}%%
								graph TD
								E1("(E)")
								E2("(E)")
								PLUS["+"]
								E3("(E)")
								E4("(E)")
								TIMES[("*")]
								E5("(E)")
								N1["3"]
								N2["2"]
								N3["4"]
							
								E1 --> PLUS
								E1 --> E2
								E1 --> E3
								E3 --> TIMES
								E3 --> E4
								E3 --> E5
								E2 --> N1
								E4 --> N2
								E5 --> N3	
							

Abstract Syntax Tree

An Abstract Syntax Tree (AST) is a hierarchical tree representation of the abstract syntactic structure of a string in formal language. Each node of the tree denotes a symbol occurring in the string.

Unlike an parse tree, AST abstracts away many of the specific syntactic details.

ASTs are used in the syntax analysis phase of compilers to simplify the structure of the source code, omitting certain syntactic elements to focus more on the semantic structure of the code.

Abstract Syntax Tree - Example

E → E + E
E → E * E
E → ( E )
E → number

								%%{
									init: {
										'theme': 'base',
										'themeVariables': {
											'fontSize': '24px',
											'darkmode': true,
											'lineColor': '#F8B229'
										}
									}
								}%%
								graph TD
								A[+]
								B[3]
								C[*]
								D[2]
								E[5]
								A --> B
								A --> C
								C --> D
								C --> E			
							

How to build Abstract Syntax Tree?

Basic FSA doesn't know anything about context, so can't select right transition!

How to build Abstract Syntax Tree?

Add stack to FSA and use it to traverse AST while building.

An FSA (Finite State Automaton) augmented with a stack is called a Pushdown Automaton (PDA). The stack gives the automaton memory, allowing it to recognize context-free languages, which are a superset of regular languages. The stack can be pushed to, popped from, or inspected, and these operations can influence the transitions between states in the PDA.

The name "pushdown" comes from the way the stack is used: data is "pushed down" onto the stack and later "popped up" from the stack. The presence of the stack allows a PDA to remember an unbounded amount of information, which is crucial for recognizing languages that have nested or recursive structures

Syntax Analyzer

A syntax analyzer, commonly known as a parser, is a component in the compilation process responsible for taking a stream of tokens (typically produced by a lexical analyzer) and determining if they form a syntactically correct sequence. It also often constructs a representation of the structure of that sequence, commonly in the form of a parse tree or an abstract syntax tree (AST).

Top-down parsing starts from the root (the start symbol) and constructs the parse tree down to the leaves (the tokens). It starts with the start symbol and tries to rewrite it to match the input. In the process, it predicts which production to use and then matches the production against the input.

Recursive Descent Parsing is a kind of top-down parsing where each non-terminal in the grammar is implemented as a function. The parser tries to apply the productions to predict and match the input.

Example

					
						class RecursiveDescentParser(private val tokens: List) {
							private var currentTokenIndex = 0
						
							fun parse(): ExprNode {
								val result = expression()
								expectEndOfInput()
								return result
							}
						
							private fun expression(): ExprNode {
								var node = term()
								while (peekToken() is OpToken && (peekOp() == '+' || peekOp() == '*')) {
									val op = nextOp()
									val rightNode = term()
									node = BinOpNode(node, op, rightNode)
								}
								return node
							}
						
							private fun term(): ExprNode {
								return when (val token = nextToken()) {
									is LeftParenToken -> {
										val node = expression()
										consumeToken(RightParenToken)
										node
									}
									is NumberToken -> NumberNode(token.value)
									else -> throw IllegalArgumentException("Unexpected token: $token")
								}
							}
						
							private fun peekToken(): Token = tokens[currentTokenIndex]
							private fun nextToken(): Token = tokens[currentTokenIndex++]
							private fun peekOp(): Char? = (peekToken() as? OpToken)?.op
							private fun nextOp(): Char = (nextToken() as OpToken).op
						
							private fun consumeToken(expected: Token) {
								if (peekToken() == expected) {
									currentTokenIndex++
								} else {
									throw IllegalArgumentException("Expected token: $expected, but found: ${peekToken()}.")
								}
							}
							private fun expectEndOfInput() {
								if (currentTokenIndex != tokens.size) {
									throw IllegalArgumentException("Unexpected token at position $currentTokenIndex: ${tokens[currentTokenIndex]}")
								}
							}
						}
					
				

IDE Challenges

Handling Lexical Errors

If a lexer encounters an unidentifiable sequence, it's termed a 'lexical error'. While compilers might halt or flag an error at this point, a IDE needs to gracefully handle the text that follows.

Recommended Actions:
  • Abort Recognition: Terminate the current lexeme's processing.
  • Skip to Validity: Bypass the subsequent characters until a recognizable sequence is found, ensuring the lexer is always ready for the next valid input.

A user-friendly solution could also involve generating a specific 'Lexer Error' token, which can then be highlighted in the editor, giving developers a clear visual cue of the issue.

Incremental Lexical Analysis

This involves analyzing only the parts of a code file that have changed, rather than the entire file.

How to implement:
  • Save each token with its start and end places in the code.
  • When Code Changes:
    Remove the tokens in the changed code part.
    Update the position of other tokens.
    Execute the analysis from the first changed part.
Possible challenges:
  • If Only a Small Part of a Token Changes: Start the analysis from the beginning of that token.
  • When Comments or Strings Change: These tokens are linked. Remember this link when analyzing again.

Syntax Error Recovery Strategies

Panic Mode Recovery

When an error is detected, skip tokens until a specific "synchronizing" token is found. This often is a token that marks the end of a statement (e.g., a semicolon in many languages or a newline in Python).

Phrase Level Recovery

For some grammar rules, we can predict the likely ending (like a semicolon for an assignment). If an unexpected token is encountered, insert the predicted one. However, this may lead to additional syntax errors that don't exist in the actual code.

Error Productions

If certain common errors are known, they can be represented in the grammar rules. This allows the parser to produce AST nodes for these errors, and later during analysis, the IDE can suggest fixes for them.

Incremental Syntax Analysis

In an Integrated Development Environment (IDE), it's not feasible to rebuild the entire Abstract Syntax Tree (AST) for every minor code change. The solution is to rebuild only the modified part of the tree.

How to implement:
  1. Remember which part of the text each AST node corresponds to.
  2. On a text change, find the node that encompasses the modified part.
  3. Reanalyze this section and construct a new subtree.
  4. Replace the old subtree with the new one.
  5. Update position data in the affected nodes.

Conclusion

Programming Language Processing in IDE - Updated

						%%{
							init: {
								'theme': 'base',
								'themeVariables': {
									'fontSize': '35px',
									'darkmode': true,
									'lineColor': '#F8B229'
								}
							}
						}%%
						graph LR
						A[Code] --> B[Lexical Analysis]
						B --> C[Tokens]
						C --> D[Syntax Analysis]
						D --> E[AST]
						E --> F[???]
						F --> G[PSI]
					

What to read?

Compilers: Principles, Techniques, and Tools
by Alfred Aho, Jeffrey Ullman, Ravi Sethi

Next: Understanding Programming Language Part 2

In our subsequent lecture, we will delve deep into the world of Symbol Tables. Symbol tables are data structures that store information about the program's various symbols, including variable names, function names, and more.

We will also explore the process of References Resolving. This step ensures that every identifier or symbol used in your code corresponds to a valid declaration and that no ambiguities exist.

Questions & Answers

Thank you for your attention!

I'm now open to any questions you might have.