Compilation

An efficient implementation of any language requires compiling it from a textual form into some kind of easily (and quickly) executable form. MiniD is no different.

Note that this part of the spec is not necessarily a binding contract for all implementations of MiniD. Other compilers may compile the source in a different way, using a single-pass compiler, adding layers of optimization, or even compiling it into a completely different format. There is also nothing precluding MiniD from being a natively compiled language.

Phases of Compilation

The MiniD compiler conceptually has several phases of compilation, but in the interest of speed and lower memory consumption, some of them have been combined.

Here are the phases:

  1. Lexical Analysis - The source code is read from an input stream and segmented up into tokens. Illegal characters and badly-formatted tokens are found and rejected in this phase.
  2. Syntactic Analysis - The token stream is parsed to form syntax trees. This phase determines the overall structure of the program and ensures that consecutive tokens make sense and are not just gibberish.
  3. Semantic Analysis - The syntax tree of the program is checked for consistency and validity. This checks references to local variables and ensures proper use of some constructs. At the same time, some more complex constructs are rewritten in terms of simpler ones, or are rewritten into forms internal to the compiler, and constant folding is performed.
  4. Code Generation - The semantically-analyzed tree is again traversed to generate a data structure that contains bytecode which can be run by the interpreter.

These are the conceptual phases. In the reference implementation, the compiler combines phases 1 and 2 into a single "parsing" pass, phase 3 is just code folding and rewriting, and phase 4 performs variable lookup and code generation. Phases 1 and 2 are also run at the same time because the lexical analysis is partially dependent upon the syntactic analysis, at least as far as whether or not newlines are significant (i.e. whether they end a statement or are just whitespace). There are technically other ways of implementing this but this method was also chosen to eliminate the need to allocate memory for every token in the source. Variable references are checked during code generation for simplicity and to reduce the need to allocate memory.

The compiler actually allows you to "intercept" the compilation between the parsing and codegen phases, giving you access to the abstract syntax tree of the code. This can be used to do analysis on the code for things like lint tools or integration into IDEs. The AST can also be manipulated, or even a new AST created entirely from scratch. Finally the AST can be codegen'ed, outputting the aforementioned bytecode.

Once the bytecode has been generated, it is either run directly from memory or saved to a module file for later use. For information on how the bytecode is executed, see Execution.