IDE Development Course
Andrew Vasilyev
%%{ init: { 'theme': 'base', 'themeVariables': { 'fontSize': '35px', 'darkmode': true, 'lineColor': '#F8B229' } } }%% graph LR A[Code] --> B[???] B --> C[PSI]
It's a mathematical model of computation. It represents a system with a limited number of states, transitions between those states, and a starting state. FSAs are used in parsing, lexical analysis, and many other tasks.
%%{ init: { 'theme': 'base', 'themeVariables': { 'fontSize': '24px', 'darkmode': true, 'lineColor': '#F8B229' } } }%% graph TD A((start)) -->|c| B[c] B -->|h| C[ch] B -->|a| H[ca] C -->|a| D[cha] C -->|e| E[che] D -->|t| G(chat) E -->|a| F[chea] F -->|t| I(cheat) H -->|t| J(cat) G --> Z(end) I --> Z J --> Z
fun recognize(input: String): Boolean {
enum class State {
START, STATE1, STATE2, STATE3, STATE4, STATE5, STATE6, END
}
var currentState = State.START
for (char in input) {
currentState = when (currentState) {
State.START -> when (char) {
'c' -> State.STATE1
else -> return false
}
State.STATE1 -> when (char) {
'h' -> State.STATE2
'a' -> State.STATE3
else -> return false
}
State.STATE2 -> when (char) {
'a' -> State.STATE4
'e' -> State.STATE5
else -> return false
}
State.STATE3 -> when (char) {
't' -> State.END
else -> return false
}
State.STATE4 -> when (char) {
't' -> State.END
else -> return false
}
State.STATE5 -> when (char) {
'a' -> State.STATE6
else -> return false
}
State.STATE6 -> when (char) {
't' -> State.END
else -> return false
}
State.END -> return false
}
}
return currentState == State.END
}
fun recognize(input: String): Boolean {
val chars = input.toCharArray()
var position = 0
fun accept(c: Char): Boolean {
if (position < chars.size && chars[position] == c) {
position++
return true
}
return false
}
fun stateStart(): Boolean {
return accept('c') && state1()
}
fun state1(): Boolean {
return when {
accept('h') -> state2()
accept('a') -> state3()
else -> false
}
}
fun state2(): Boolean {
return when {
accept('a') -> state4()
accept('e') -> state5()
else -> false
}
}
fun state3(): Boolean {
return accept('t')
}
fun state4(): Boolean {
return accept('t')
}
fun state5(): Boolean {
return accept('a') && state6()
}
fun state6(): Boolean {
return accept('t')
}
return stateStart() && position == chars.size
}
fun main() {
println(recognize("chat")) // true
println(recognize("cheat")) // true
println(recognize("cat")) // true
}
interface State {
fun next(c: Char): State
fun isEnd(): Boolean = false
}
object StartState : State {
override fun next(c: Char) = if (c == 'c') State1 else this
}
object State1 : State {
override fun next(c: Char) = when (c) {
'h' -> State2
'a' -> State3
else -> this
}
}
object State2 : State {
override fun next(c: Char) = when (c) {
'a' -> State4
'e' -> State5
else -> this
}
}
object State3 : State {
override fun next(c: Char) = if (c == 't') EndState else this
}
object State4 : State {
override fun next(c: Char) = if (c == 't') EndState else this
}
object State5 : State {
override fun next(c: Char) = if (c == 'a') State6 else this
}
object State6 : State {
override fun next(c: Char) = if (c == 't') EndState else this
}
object EndState : State {
override fun next(c: Char) = this
override fun isEnd() = true
}
fun recognize(input: String): Boolean {
var currentState: State = StartState
for (char in input) {
currentState = currentState.next(char)
}
return currentState.isEnd()
}
| Current State | 'c' | 'h' | 'a' | 'e' | 't' | any other |
|---------------|------|------|------|------|------|-----------|
| StartState | S1 | - | - | - | - | - |
| S1 | - | S2 | S3 | - | - | - |
| S2 | - | - | S4 | S5 | - | - |
| S3 | - | - | - | - | End | - |
| S4 | - | - | - | - | End | - |
| S5 | - | - | S6 | - | - | - |
| S6 | - | - | - | - | End | - |
| EndState | - | - | - | - | - | - |
enum class State {
StartState, S1, S2, S3, S4, S5, S6, EndState
}
fun recognize(input: String): Boolean {
val transitionTable: Map> = mapOf(
State.StartState to mapOf('c' to State.S1),
State.S1 to mapOf('h' to State.S2, 'a' to State.S3),
State.S2 to mapOf('a' to State.S4, 'e' to State.S5),
State.S3 to mapOf('t' to State.EndState),
State.S4 to mapOf('t' to State.EndState),
State.S5 to mapOf('a' to State.S6),
State.S6 to mapOf('t' to State.EndState)
)
var currentState = State.StartState
for (char in input) {
currentState = transitionTable[currentState]?.get(char) ?: return false
}
return currentState == State.EndState
}
Formal languages provide a structured way to represent and process information.
Formal language is a set of strings over a fixed alphabet of symbols. The rules for combining these strings are governed by grammars.
Consider a formal language over the alphabet {0, 1} that consists of all strings where every 0 is immediately followed by a 1. Examples of strings in this language include "1", "11", "101", and "1101". Strings like "10", "100", and "1100" are not in this language.
String - a finite sequence of symbols chosen from a set called an alphabet. For instance, "1101" is a string over the alphabet {0, 1}.
Alphabet - a finite set of symbols. These symbols are the basic building blocks that can be used to construct strings in a formal language. For example, the binary alphabet is {0, 1}.
Symbol - a basic unit in an alphabet. It's an individual character or element that can be used to construct strings. In the English alphabet, "A", "B", and "C" are symbols. In the binary alphabet, 0 and 1 are symbols.
Formal grammar is a set of production rules for strings in a formal language. The grammar describes how to form strings from the language's alphabet that are valid according to the language's syntax. These rules determine how symbols can be combined to produce valid strings.
A grammar is said to be "context-free" if the production rules are applied regardless of the surrounding context of the symbols. This is in contrast to context-sensitive grammars, where the application of rules can depend on the surrounding symbols.
S → 1S | 01S | ε
S is a non-terminal symbol.
1, 0, and ε (empty string) are terminal symbols.
The grammar says that a string S can be replaced by:
A "1" followed by another string S.
A "01" followed by another string S.
An empty string ε.
With these rules, we can derive strings like "1", "11", "101", "1101", etc., but not strings like "10" or "100".
%%{ init: { 'theme': 'base', 'themeVariables': { 'fontSize': '24px', 'darkmode': true, 'lineColor': '#F8B229' } } }%% graph TD A((start)) -->|1| A A -->|0| B B -->|1| A B -->|0| C[error] C -->|1| C C -->|0| C
T={ a-z , 0-9 }
S → Identifier | Constant
Identifier → Char AlphaNums ε
Constant → Digits ε
Digits → Digit | Digit Digits
AlphaNums → CharOrDigit | CharOrDigit AlphaNums
CharOrDigit → a-z | 0-9
%%{ init: { 'theme': 'base', 'themeVariables': { 'fontSize': '24px', 'darkmode': true, 'lineColor': '#F8B229' } } }%% graph TD Start((Start)) -->|a-z| IdentifierStart Start -->|0-9| ConstantStart IdentifierStart -->|a-z, 0-9| IdentifierContinue IdentifierContinue -->|a-z, 0-9| IdentifierContinue ConstantStart -->|0-9| ConstantContinue ConstantContinue -->|0-9| ConstantContinue
Tokens are sequences of characters with a collective meaning. Examples include keywords, identifiers, operators, and literals.
Lexical analisys is process of translating input stream to a stream of meaningful tokens.
Lexical analyzer (or lexer) is automata that recognizes tokens in a stream of data and produces stream of meaningful tokens.
To implement lexer: - Define grammar to describe tokens - Build FSA inputting stream of characters and outputting stream of tokens
sealed class Token
data class IdentifierToken(val name: String) : Token()
data class ConstantToken(val value: Int) : Token()
class Lexer(private val input: String) {
private var currentIndex = 0
fun tokenize(): List {
val tokens = mutableListOf()
skipWhitespace()
while (currentIndex < input.length) {
when (val char = peek()) {
in '0'..'9' -> tokens.add(constant())
in 'a'..'z' -> tokens.add(identifier())
else -> throw IllegalArgumentException("Unexpected character '$char' at position $currentIndex.")
}
skipWhitespace()
}
return tokens
}
private fun peek(): Char = if (currentIndex < input.length) input[currentIndex] else 0.toChar()
private fun nextChar(): Char = if (currentIndex < input.length) input[currentIndex++] else 0.toChar()
private fun skipWhitespace() {
while (currentIndex < input.length && peek().isWhitespace()) {
currentIndex++
}
}
private fun constant(): ConstantToken {
val sb = StringBuilder()
while (currentIndex < input.length && Character.isDigit(peek())) {
sb.append(nextChar())
}
return ConstantToken(sb.toString().toInt())
}
private fun identifier(): IdentifierToken {
val sb = StringBuilder()
while (currentIndex < input.length && (Character.isDigit(peek()) || Character.isLowerCase(peek()))) {
sb.append(nextChar())
}
return IdentifierToken(sb.toString())
}
}
A parse tree, also known as a concrete syntax tree, is a tree representation that reflects the syntactic structure of a string according to some formal grammar.
E → E + E
E → E * E
E → ( E )
E → number
Rules are dependent on surrounding symbols, so grammar is context-dependant!
%%{ init: { 'theme': 'base', 'themeVariables': { 'fontSize': '24px', 'darkmode': true, 'lineColor': '#F8B229' } } }%% graph TD E1("(E)") E2("(E)") PLUS["+"] E3("(E)") E4("(E)") TIMES[("*")] E5("(E)") N1["3"] N2["2"] N3["4"] E1 --> PLUS E1 --> E2 E1 --> E3 E3 --> TIMES E3 --> E4 E3 --> E5 E2 --> N1 E4 --> N2 E5 --> N3
An Abstract Syntax Tree (AST) is a hierarchical tree representation of the abstract syntactic structure of a string in formal language. Each node of the tree denotes a symbol occurring in the string.
Unlike an parse tree, AST abstracts away many of the specific syntactic details.
ASTs are used in the syntax analysis phase of compilers to simplify the structure of the source code, omitting certain syntactic elements to focus more on the semantic structure of the code.
E → E + E
E → E * E
E → ( E )
E → number
%%{ init: { 'theme': 'base', 'themeVariables': { 'fontSize': '24px', 'darkmode': true, 'lineColor': '#F8B229' } } }%% graph TD A[+] B[3] C[*] D[2] E[5] A --> B A --> C C --> D C --> E
Basic FSA doesn't know anything about context, so can't select right transition!
Add stack to FSA and use it to traverse AST while building.
An FSA (Finite State Automaton) augmented with a stack is called a Pushdown Automaton (PDA). The stack gives the automaton memory, allowing it to recognize context-free languages, which are a superset of regular languages. The stack can be pushed to, popped from, or inspected, and these operations can influence the transitions between states in the PDA.
The name "pushdown" comes from the way the stack is used: data is "pushed down" onto the stack and later "popped up" from the stack. The presence of the stack allows a PDA to remember an unbounded amount of information, which is crucial for recognizing languages that have nested or recursive structures
A syntax analyzer, commonly known as a parser, is a component in the compilation process responsible for taking a stream of tokens (typically produced by a lexical analyzer) and determining if they form a syntactically correct sequence. It also often constructs a representation of the structure of that sequence, commonly in the form of a parse tree or an abstract syntax tree (AST).
Top-down parsing starts from the root (the start symbol) and constructs the parse tree down to the leaves (the tokens). It starts with the start symbol and tries to rewrite it to match the input. In the process, it predicts which production to use and then matches the production against the input.
Recursive Descent Parsing is a kind of top-down parsing where each non-terminal in the grammar is implemented as a function. The parser tries to apply the productions to predict and match the input.
class RecursiveDescentParser(private val tokens: List) {
private var currentTokenIndex = 0
fun parse(): ExprNode {
val result = expression()
expectEndOfInput()
return result
}
private fun expression(): ExprNode {
var node = term()
while (peekToken() is OpToken && (peekOp() == '+' || peekOp() == '*')) {
val op = nextOp()
val rightNode = term()
node = BinOpNode(node, op, rightNode)
}
return node
}
private fun term(): ExprNode {
return when (val token = nextToken()) {
is LeftParenToken -> {
val node = expression()
consumeToken(RightParenToken)
node
}
is NumberToken -> NumberNode(token.value)
else -> throw IllegalArgumentException("Unexpected token: $token")
}
}
private fun peekToken(): Token = tokens[currentTokenIndex]
private fun nextToken(): Token = tokens[currentTokenIndex++]
private fun peekOp(): Char? = (peekToken() as? OpToken)?.op
private fun nextOp(): Char = (nextToken() as OpToken).op
private fun consumeToken(expected: Token) {
if (peekToken() == expected) {
currentTokenIndex++
} else {
throw IllegalArgumentException("Expected token: $expected, but found: ${peekToken()}.")
}
}
private fun expectEndOfInput() {
if (currentTokenIndex != tokens.size) {
throw IllegalArgumentException("Unexpected token at position $currentTokenIndex: ${tokens[currentTokenIndex]}")
}
}
}
If a lexer encounters an unidentifiable sequence, it's termed a 'lexical error'. While compilers might halt or flag an error at this point, a IDE needs to gracefully handle the text that follows.
A user-friendly solution could also involve generating a specific 'Lexer Error' token, which can then be highlighted in the editor, giving developers a clear visual cue of the issue.
This involves analyzing only the parts of a code file that have changed, rather than the entire file.
When an error is detected, skip tokens until a specific "synchronizing" token is found. This often is a token that marks the end of a statement (e.g., a semicolon in many languages or a newline in Python).
For some grammar rules, we can predict the likely ending (like a semicolon for an assignment). If an unexpected token is encountered, insert the predicted one. However, this may lead to additional syntax errors that don't exist in the actual code.
If certain common errors are known, they can be represented in the grammar rules. This allows the parser to produce AST nodes for these errors, and later during analysis, the IDE can suggest fixes for them.
In an Integrated Development Environment (IDE), it's not feasible to rebuild the entire Abstract Syntax Tree (AST) for every minor code change. The solution is to rebuild only the modified part of the tree.
%%{ init: { 'theme': 'base', 'themeVariables': { 'fontSize': '35px', 'darkmode': true, 'lineColor': '#F8B229' } } }%% graph LR A[Code] --> B[Lexical Analysis] B --> C[Tokens] C --> D[Syntax Analysis] D --> E[AST] E --> F[???] F --> G[PSI]
Compilers: Principles, Techniques, and Tools
by Alfred Aho, Jeffrey Ullman, Ravi Sethi
In our subsequent lecture, we will delve deep into the world of Symbol Tables. Symbol tables are data structures that store information about the program's various symbols, including variable names, function names, and more.
We will also explore the process of References Resolving. This step ensures that every identifier or symbol used in your code corresponds to a valid declaration and that no ambiguities exist.
Thank you for your attention!
I'm now open to any questions you might have.