Monday, 5 January 2009

Which Codegen and IR to use?

I know I haven't really explained much about why I want to build out a JVM with InvokeDynamic, so I thought I'd try and put some of my thoughts down.

Broadly, I'd like to be able to write a compiler for a dynamic language, consisting of a modest subset of Perl 5's syntax. This is largely just to teach myself more about the internals of the JVM, and about language design and implementation in general, but it might come in handy for other people as well - who knows.

I have been getting to grips with the sablecc parser generator toolkit, and already have a grammar for a working subset of the language (although there's quite a long way to go yet).

Sablecc has some features which rather surprised me at first - its Node is an abstract class rather than an interface and the classes which represent terminals in the abstract syntax are all declared final.

This makes it rather hard to use Decorator to annotate the abstract parse tree (as for a Decorator-type approach one would subclass the abstract parse tree terminals).

After a bit of head scratching, I'm now generating a set of Analyzer classes, each of which are obtained by visiting the tree, and storing additional context information in the Analyzer classes themselves.

Each Analyzer instances can now be though of as being a self-consistent set of annotations about the parse tree - all the annotations relevant to a particular type of analysis can be represented as one Analyzer, but there may be multiple Analyzers corresponding to different, disjoint classes of semantic information.

With this now reasonably well understood, I'm turning my attention to the IR and code generation stages. I'm weighing up 2 possible tools and approaches.
  1. ASM. This has a lot of advantages - it is reasonably well documented, once you're over the initial hump, and seems to have good low-level control as well as consistent use of the Visitor pattern to help with conceptual consistency.
  2. The gnu.bytecode and gnu.expr packages from Kawa. These are part of Kawa (a Scheme-orientated package for producing bytecode).
At the moment I'm still trying to decide which I'm going to opt for. Kawa has the advantage that several other languages are already using it, so it may have better performance for repl and similar approaches (which will be necessary if I want to support eval STRING as a case). It also seems to have a relatively consistent set of classes which could be targeted as an IR (which so far, I don't really see ASM having).

However, Kawa doesn't (so far) to be very well documented - I've found myself having to dig through the source code several times already just to find out whether method names need to be specified as package names or internal names, etc. The conceptual framework doesn't seem to be quite as consistent somehow. I'm also slightly scared by the use of bare integers in some places in the gnu.bytecode package - I'd prefer a symbolic constant for each opcode, which I can pull in with a wildcarded static import.

Still, I'll post more as I investigate further.