1 | This blog is from minko_gechev, and in following content, the "I" refers to minko_gechev. And this blog inspire me in a very impressive way. |
I already wrote a couple of essays related to the development of programming languages that I was extremely excited about! For instance, in “Static Code Analysis of Angular 2 and TypeScript Projects“[1] I explored the basics of the front end of the compilers, explaining the phases of lexical analysis, syntax analysis and abstract-syntax trees.
Recently I published “Developing Statically Typed Programming Language“[2]. This post shown a simple, statically typed, functional programming language inspired by lambda calculus. There I outsourced the front end part of the compiler development to a parser generator and focused on the back end in the faces of a module for type checking and one for code generation.
Why do I need this?
You might be now wondering “Why would I need to know how to develop a compiler?“. There are a few important reasons:
- You will get a better understanding of how the programming languages you’re using work. This will allow you to develop more efficient programs with them.
- Often, you’ll have to reuse pieces of the modules described below for different purposes (for instance, parsing of configuration files, parsing of network messages, etc.).
- Creating a DSL. Creating a Domain Specific Language in your project could be quite handy in order to simplify tasks which otherwise take a lot of time to solve with a general purpose programming language.
What are we going to cover?
In this blog post we’ll cover the basics from end-to-end! We’ll develop an extremely simple compiler on 25 lines of JavaScript! Our compiler will have:
- Module for lexical analysis
- Module for syntax analysis
- The parser will be based on an EBNF grammar
- We will develop the parser by using a recursive descent parsing algorithm
- Code generator
The language that we’re going to explore is not particularly useful for developing meaningful software programs but it can be easily extended to one.
The entire implementation can be found at my GitHub profile[3].
Introducing a Simple Prefix Language
Here’s how a sample expression in our language is going to look like:
1 | mul 3 sub 2 sum 1 3 4 |
By the end of this article we’ll be able to transpile these expressions to JavaScript by going through phases typical for any compiler!
For simplicity, there are a few rules we need to follow:
- We have only the functions:
mul
,sub
,sum
,div
. - Each individual string token is surrounded by whitespace.
- We support only natural numbers.
In order to explore the semantics behind the expression above lets define a few JavaScript functions:
1 | const mul = (...operands) => operands.reduce((a, c) => a * c, 1); |
mul
accepts multiple operands, passed with the spread operator. The function just multiplies all of them, so for instance mul(2, 3, 4) === 24
. sub
respectively subtracts the passed arguments and sum
sums them.
The expression above can be translated to the following JavaScript expression:
1 | mul(3, sub(2, sum(1, 3, 4))) |
or…
1 | 3 * (2 - (1 + 3 + 4)) |
Now, after we have understanding of the semantics, lets start with the front end of the compiler!
Note: Similar prefix expressions can be simply evaluated with a stack-based algorithm, however, in this case we’ll focus on concepts rather than implementation.
Developing the Compiler’s Front End
The front end of any compiler usually has the modules for Lexical Analysis[4] and Syntax Analysis[5]. In this section we’ll build both modules in a few lines of JavaScript!
Developing a Lexical Analyzer
The phase of lexical analysis is responsible for dividing the input string (or stream of characters) of the program into smaller pieces called tokens. The tokens usually carries information about their type (if they are numbers, operators, keywords, identifiers, etc), the substring of the program they represent and their position in the program. The position is usually used for reporting user friendly errors in case of invalid syntactical constructs.
For instance, if we have the JavaScript program:
1 | if (foo) { |
A sample lexical analyzer for JavaScript will produce the output:
1 | [ |
We’ll keep our lexical analyzer as simple as possible. In fact, we’ll implement it on a single line of JavaScript:
1 | const lex = str => str.split(' ').map(s => s.trim()).filter(s => s.length); |
Here we split the string by a single space, we map the produced substrings to their trimmed version and filter the empty strings.
Invoking the lexer with an expression will produce an array of strings:
1 | lex('mul 3 sub 2 sum 1 3 4') |
This is completely enough for our purpose!
Now lets go to the phase of syntax analysis!
Developing a Parser
The syntax analyzer (often know as parser) is the module of a compiler which out of a list (or stream) of tokens produces an Abstract Syntax Tree[6] (or in short an AST). Along the process, the syntax analyzer may also produce syntax errors in case of invalid programs.
Usually, the parser is implemented based on a grammar[7]. Here’s the grammar of our language:
1 | digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
This basically means that we have digits which composed together can form numbers (num
). We have 4 operations and an expression can be either a num
ber, or op
eration, followed by one or more expr
essions. We’ll refer to the individual definitions in the grammar (for instance to num
and op
) as rules.
This is the so called EBNF grammar[6]. Look at the grammar for a bit, try to understand it, and after that completely forget about it! We’ll come back to the grammar after we explain the parser and you’ll see how everything connects together!
As we mentioned, a parser is a tool which turns a list of tokens to an AST.
For instance, for our expression:
1 | mul 3 sub 2 sum 1 3 4 |
The parser will produce the following AST, based on the grammar above:
Lets explore the algorithm for this!
1 | const Op = Symbol('op'); |
The bad news is that there are a lot of things going on. The good news is that this is the most complicated part of the compiler!
Lets divide the code into parts and look into each one step by step.
Node Types
1 | const Op = Symbol('op'); |
First we define the different node types that we are going to have in the AST. We’ll have only numbers and operations. For example, the number node 42
will look like this:
1 | { |
The operator sum
, applied to 2
, 3
, 4
will look like this:
1 | { |
That’s how simple is that!
The Parser
After we declare the node types, we define a function called parse
which accepts a single argument called tokens
. Inside of it we define five more functions:
peek
- returns the element oftokens
associated with the current value of thec
local variable.consume
- returns the element oftokens
associated with the current value of thec
local variable and incrementsc
.parseNum
- gets the current token (i.e. invokespeek()
), parses it to a natural number and returns a new number token.parseOp
- we’ll explore in a little bit.parseExpr
- checks if the current token matches the regular expression/\d/
(i.e. is a number) and invokesparseNum
if the match was successful, otherwise returnsparseOp
.
Parsing Operations
The parseOp
is maybe the most complicated function from the parser above. That’s the case because of the loop and the indirect recursion that we have. Here’s its definition one more time in order to explain it line by line:
1 | const parseOp = () => { |
Since parseOp
has been invoked by parseExpr
when the value of peek()
is not a number we know that it is an operator so we create a new operation node. Note that we don’t perform any further validation, however, in a real-world programming language we’d want to do that and eventually throw a syntax error in case of unexpected token.
Anyhow, in the node declaration we set the list of “sub-expressions” to be the empty list (i.e. []
), the operation name to the value of consume()
and the type of the node to Op
. Later, while we don’t reach the end of the program, we loop over all tokens by pushing the currently parsed expression to the list of “sub-expressions` of the given node. Finally, we return the node.
Keep in mind that while (peek()) node.expr.push(parseExpr());
performs an indirect recursion. In case we have the expression:
1 | sum sum 2 |
This will
- First, invoke
parseExpr
, which will find that the current token (i.e.tokens[0]
) is not a number (it’ssum
) so it’ll invokeparseOp
. - After that
parseOp
will create the operation node and because of theconsume()
call, increment the value ofc
. - Next
parseOp
will iterate over the tokens, and fortokens[c]
, wherec
now equals1
will invokeparseExpr
. parseExpr
will find that the current token is not a number so it’ll invokeparseOp
.parseOp
will create another operation node and incrementc
and will start looping over the remaining tokens again.parseOp
will invokeparseExpr
wherec
will now equal2
.- Since
tokens[2] === "2"
,parseExpr
will invokeparseNum
which will create a number node, incrementing thec
variable. parseNum
will return the number node and it will be pushed into theexpr
array of the last operation node produced by the latestparseOp
invocation.- The last
parseOp
invocation will return the operation node sincepeek()
will returnundefined
(parseNum
has incrementedc
to3
andtokens[3] === undefined
). - The node returned by the last invocation of
parseOp
will be returned to the outermost invocation ofparseOp
which will return its operation node as well. - Finally,
parseExpr
will return the root operation node.
The produced AST will look like:
1 | { |
…and that’s it! The final step is to traverse this tree and produce JavaScript!
Recursive Descent Parsing
Now lets see how are the individual functions related to the grammar we defined above and see why having a grammar makes sense in general. Lets take a look at the rules in the EBNF grammar:
1 | digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Do they now may make a bit more sense? expr
looks very much like parseExpr
, where we parse either a num
ber or an op
eration. Similarly, op expr+
looks very much like parseOp
and num
like parseNum
. In fact, very often parsers are generated directly from the grammars since there’s a direct connection between both with the recursive descent parsing algorithm[8].
And in fact, we just developed a simple recursive descent parser! Our parser was quite simple (well, we have only 4 production rules in the grammar) but you can imagine how complex the parser of a real-life programming language is.
It’s extremely convenient to develop the grammar of a language before writing the actual parser in order to observe a simplified model of it. The parser contains a lot of details (for instance a lot of syntax constructs of the language you’re developing it with), in contrast to the grammar which is extremely simplified and minimalistic.
Developing the Transpiler
In this part of the post we’ll traverse the AST of the language and produce JavaScript. The entire transpiler is on 7 lines of JavaScript (literally!):
1 | const transpile = ast => { |
Lets explore the implementation line by line.
First we define a function called transpile
. It accepts as argument the AST produced by the parser. After that in the opMap
we define the mapping between arithmetic operations and the operators in the language. Basically, sum
maps to +
, mul
to *
, etc.
As next step, we define the function transpileNode
which accepts an AST node. If the node is a number, we invoke the transpileNum
function with the given node, otherwise, we invoke transpileOp
.
Finally, we define the two functions for transpilation of the individual nodes:
transpileNum
- translates a number to a JavaScript number (simply by returning it).transpileOp
- translates an operation to a JavaScript arithmetic operation. Notice that we have indirect recursion here (transpileOp -> transpileNode -> transpileOp
). For each operation node, we want to transpile its sub-expressions first. We do that by invoking thetranspileNode
function.
On the last line of transpile
‘s body, we return the result of transpileNode
applied to the root of the tree.
Wiring Everything Together
Here’s how we can wire everything together:
1 | const program = 'mul 3 sub 2 sum 1 3 4'; |
We invoke lex(program)
, which produces the list of tokens, after that we pass the tokens to the parse
function, which produces the AST and finally, we transpile
the AST to JavaScript!
Conclusion
This article explained in details the development of a very simple compiler (or transpiler) of a language with prefix expressions to JavaScript. Although this was explanation of only the very basics of the compiler development we were able to cover few very important concepts:
- Lexical analysis
- Syntax analysis
- Source code generation
- EBNF grammars
- Recursive Descent Parsing
If you’re interested in further reading, I’d recommend:
- Developing Statically Typed Programming Language
- Let’s Build A Simple Interpreter
- Compilers: Principles, Techniques, and Tools (2nd Edition) 2nd Edition
- Types and Programming Languages
References
- Static Code Analysis of Angular 2 and TypeScript Projects - https://mgv.io/ng-static-analysis.
- Developing Statically Typed Programming Language - https://mgv.io/typed-lambda
- Tiny Compiler - https://mgv.io/tiny-compiler
- Lexical Analysis - https://mgv.io/wiki-lex
- Syntax Analysis - https://mgv.io/wiki-parser
- Abstract Syntax Tree - https://mgv.io/wiki-case-ast
- EBNF grammar - https://mgv.io/wiki-ebnf
- Recursive Descent Parsing - https://mgv.io/wiki-recursive-descent-parser