At EqualTo we built a spreadsheet engine from scratch in the Rust programming language. We can compile it to WebAssembly and run it in the browser or compile it to machine code and run it headless on the server. The first step in writing such an engine is to write an Excel formula parser. Since writing a full-fledged Excel formula parser is a daunting task we will split across multiple articles. In this first installment today we will show you how to build a parser and interpreter for a desk calculator.
Parsing serves as the front end of a compiler and it is considered a solved problem today. Its primary purpose is to transform a human readable string into an object, the abstract syntax tree, that a computer can understand and process.
If you are creating a new programming language you will need to create a parser.
In general, there are two main approaches to parsing: you can either use a parser generator (also known as a "compiler compiler") or create a handmade parser from scratch. Times have changed, and while 30 years ago the popular choice was the parser generator, today most folks would create a handmade parser from scratch.
Parser generators typically follow a bottom-up, non-recursive algorithm. They usually run in linear time and linear space, and are table-driven. The classic example is Bison. There are many others, such as the famous ANTLR, which actually uses a top-down recursive algorithm. With these tools, you only need to specify the grammar for your language, and they will generate the code for your parser.
There are some disadvantages to parser generators:
Handmade parsers also have their drawbacks, as they're more susceptible than parser generators to errors and performance issues. Errors are more likely to occur in your code than in your grammar, although you can mitigate risks with thorough testing. On the other hand, it is easier to handle unusual corner cases in your language with a handwritten parser, which offers greater flexibility. Moreover, error reporting and recovery are much simpler with handwritten parsers.
In general, if you are writing a "small" language, I prefer a handmade parser. However, if you are working with a complex language, have time constraints, and have multiple contributors to the project, a parser generator might be the better option.
One final thing about handmade parsers... they are a lot of fun! :)
In this next section we will build a calculator using a Pratt parser. There is a fair amount of theory involved, so if some aspects are unclear upon first reading, continue reading and return to those parts later for clarification.
What I cannot create, I do not understand
- Richard Feynman
In their now almost 40 year old book The Unix Programming Environment Brian Kernighan and Rob Pike urge us to build a desk calculator. They used the standard techniques at the time, a parser generator yacc (predecessor to Bison) and lex (today you could use flex).
We are going to follow in their footsteps, and create a handmade parser for a desk calculator.
You can find all the code for this example in our GitHub repository.
We are going to split the evaluate function in two parts:
When writing a parser the first thing you need to do is to spell out the grammar for your language:
These are the blueprints of the program we are going to write. There are formal, very theoretical specifications of what a grammar is, but we will be a bit loose here.
Once your grammar is defined, you should create your data structures. Rust is very convenient for this, due to the availability of algebraic data types and enum varieties.
The list of tokens will be an enum:
The algorithm that performs this task is a Deterministic Finite Automaton (DFA). Essentially, if your tokens can be described by Regular Expressions (RE), they can be converted to a DFA and easily implemented in code (as per Thompson's construction). In this article, we will not delve into the theorem itself; instead, we will focus on how we can apply it to our specific example.
Outline of the desk calculator lexer:
Let's review. First, observe the Lexer definition:
If we were to make the same structure in TypeScript, Python, Kotlin, ... the definition would be obvious:
But in Rust we have many things to consider. Do we want the data to be owned by the Lexer, or is the data shared? The first option is easier but implies that the Lexer will make a copy of the data and do memory allocations on the heap. The second is a bit more complex because it involves lifetimes, but it's faster and more memory efficient.
Most of the code in the lexer should be relatively straightforward to read, with the possible exception of the "number recognizer".
The second step is the more difficult of the two.
The inverse problem of finding the most visually appealing way to display a given floating-point number currently relies on the Ryū algorithm by Ulf Adams, which is one of my favorite modern algorithms.
A RE for a number might be:
Our number recognizer, a finite automata, has 8 states. An "accepting state" is a state in which you can finish your parsing. It's pretty easy to transform the above diagram into code:
To iterate is human, to recurse divine.
- Peter Deutsch
Once we are able to lex tokens we are ready to build the AST. We will use a top down recursive parser.
The data structure is something like:
Traditionally, grammar is adapted to reflect the precedence rules so as to produce the desired parse tree. But we will follow a different strategy: "top down precedence parsing" was invented by Vaughan Pratt and popularized in his build of jslint by Douglas Crockfort and Bob Nystrom of Game programming patterns and Crafting interpreters fame, and must be one of the most beautiful recursive parsing algorithms ever written.
Here we are inspired by the work of Alex Kladov in implementing a Pratt parser.
The basic implementation of a top down recursive Pratt parser will therefore resemble:
You will probably remember from your school days that every recursive algorithm has an equivalent iterative algorithm. In this particular case the iterative algorithm is called shunting yard algorithm and was invented by Edsger Dijkstra 1961 for the Algol programming language.
Although not the focus for this post, a calculator would be incomplete without an evaluator. It is the soul of a tree walking interpreter.
Once we have the AST, evaluation is pretty straightforward: