Build

Build Your Own Symbol Calculator with Ohm

Michael Carroll on Sep 8, 2016
Build Your Own Symbol Calculator with Ohm

Last time I introduced Ohm, an open source meta language parser with an easy to use syntax, we built a parser for different number formats. Basically, I showed you how to easily build your own programming language with just a few lines of code. This week we will extend the parser to calculate arithmetic expressions.

Open up your grammar file and add this code inside CoolNums {

  // just a basic integer
  Expr =  AddExpr
  AddExpr = AddExpr "+" MulExpr -- plus
          | AddExpr "-" MulExpr -- minus
          | MulExpr
  MulExpr = MulExpr "*" PriExpr -- times
          | MulExpr "/" PriExpr -- divide
          | PriExpr
  PriExpr = "(" Expr ")" -- paren
          | Number

The above code is all we need to define full arithmetic with parenthesis and precedence. This is a bit complicated so let's me break it down into pieces. We can read the grammar like this: Expr, as defined as an additive expression. An additive expression can be one of an addition, subtraction, or a multiplication expression. A multiplication expression can be one of times, divide or a primary expression. A primary expression is an expression (Expr) inside parenthesis or a number.

At first this looks very strange. Why is the general expression (Expr) only an additive expression, and is AddExpr just the first line or all three lines? And how would we write this from scratch if we didn't know how to do it beforehand? Let's me break it down into pieces.

Compound Expressions

Ohm supports a compound syntax to let you define multiple rules at once. Adding a -- after a rule makes it a sub-rule with alternation. An alternation means something or something else. This code:

  AddExpr = AddExpr "+" MulExpr -- plus
          | AddExpr "-" MulExpr -- minus
          | MulExpr

is the same as this code:

AddExpr        = AddExpr_plus | AddExpr_minus | MulExpr
AddExpr_plus   = AddExpr "+" MulExpr
AddExpr_minus  = AddExpr "-" MulExpr

In other words, there are three forms of the AddExpr, such as the plus form, the minus form, and a multiplication form. Rather than having to break them out separately Ohm lets us combine them into a more compact and cleaner syntax. We still read it as: ‘an add expression can be one of add plus mul or add minus mul or just mul.'

Operator Precedence

Second, why do we need to group the plus and minus versions of add together, separate from the times and divide forms? This comes down to operator precedence.

Consider the following expression:

4 + 5 * 6

Do you evaluate the + or the * first? The order in which we execute the operators affects the final answer. In many programming languages (including JavaScript) there is a defined operator precedence order. Usually multiplication and division come before addition and subtraction. The expression above is equivalent to:

4 + (5*6)

However, some programming languages, like Smalltalk, evaluate operators left to right without any precedence. So the expression above would be equivalent to:

(4+5) * 6

For this calculator we will go with the JavaScript form of precedence. We must group the multiplication and division together and make sure they are executed before the addition and subtraction. That's why Expr is made up of AddExpr, and AddExpr contains MulExpr, and MulExpr contains the PriExpr. Only in PriExpr do we get to actual numbers. At first this seems backwards. If MulExpr comes first then why is is listed after AddExpr?

We need to consider how things will be evaluated. 4 + 5 * 6 will be parsed into this:

Add(4, Mul(5,6))

The innermost expression is evaluated first, so the MulExpr must be closest to Number. Adds will be evaluated only after all the Muls are done. We define Add in terms of Mul because Add will always contain Mul. Mul will never contain Add (without parenthesis).

I realize this is tricky to understand, and honestly it's one of the reasons I prefer Smalltalk's approach of left to right. In general you only need to implement this once and it's common to just borrow from another grammar that get's it right. I've adapted this one from the official Ohm Math example.

Performing Arithmetic

Now that we have a parser that we know processes things in the right order we can actually do some arithmetic. As we discussed last time, the _grammar_ just parses text into a tree without actually doing any work. The semantics define the real actions which do things. Each action function will perform an operation on the results of its sub nodes. So 4+5 will add the 4 and 5 nodes together. Each of those nodes is an int which returns a real JS number.

Here's what the actions look like for arithmetic:

var Calculator = grammar.createSemantics().addOperation('calc', {
    AddExpr: function(a) {
        return a.calc();
    },
    AddExpr_plus: function(a,_,b) {
        return a.calc() + b.calc();
    },
    MulExpr: function(a) {
        return a.calc();
    },
    MulExpr_times: function(a,_,b) {
        return a.calc() * b.calc();
    },
    //these are the same as before in the previous blog
    int: function(a) {
        return parseInt(this.sourceString,10);
    },
    float: function(a,b,c,d) {
        return parseFloat(this.sourceString);
    },
    hex: function(a,b) {
        return parseInt(this.sourceString.substring(2),16);
    },
    oct: function(a,b) {
        return parseInt(this.sourceString.substring(2),8);
    }
});

The first actions are the only new ones. The others are the same ones from the last blog, they parse all of the forms of numbers (integers, floating point, hexadecimal, and octal).

That's actually it. Just do the basic math for each expression. Of course, if we just built a simple calculator we wouldn't be using the full power of Ohm.

Building an Expression Tree

Our goal is to eventually extend this into a real programming language and what we have now just won't do. It evaluates expressions as they are found in the source code. That's fine for basic math, but what if we want to handle calling a function inside of a loop? The action would only be called once, but we want to invoke the function every time inside of the loop!

Eventually we will also want variables with the ability to run code like x=10 followed by x*2. Now we need a symbol table and a way to look up the current value of the symbol when the math is evaluated. Doing math immediately simply won't work anymore.

The solution is to not perform arithmetic inside of the semantic operation. Instead we must return a tree of objects which represent the arithmetic (and later loops and functions) and can be evaluated anytime we need it. This is called an expression tree, and it's the next big step for building a language interpreter.

Let's define some terms:

  • A number is an actual numeric constant. Something like 4 or 4.5 or 0x45.
  • A symbol is an identifier which points to a number. Something like x or myVal.
  • An assignment is an operation which makes a symbol point to a real number.
  • A binary operation, or BinOp, is a math operation which takes two arguments. Something like 4+5 or 4/5. Right now we only support basic arithmetic, but in the future we will support boolean operators like x

The key concept when building a language is resolution. We can say that the expression 4+5 resolves to 9. Resolution is when we do actual work; the actual calculations. Resolving a binary operation executes the actual operation. Resolving a symbol returns the underlying value that the symbol points to. Resolving a number just returns itself, the number. With these definitions we can start to build some code. First, let's create a class which represents a number. It stores an underlying javascript value, val, and returns itself when resolve is called.

class MNumber {
 constructor(val) { this.val = val; }
 resolve(scope)   { return this; }
 jsEquals(jsval)  { return this.val == jsval; }
}

Note that I added a jsEquals method. This lets us compare the number to a real Javascript number. It is not part of our exposed API, but it is helpful when writing our unit tests later.

Now we can define our basic binary operations for arithmetic. Since addition, subtraction, and the others are basically all the same, create a single BinOp class instead of one for each operation.

class BinOp {
    constructor(op, A, B) {
        this.op = op;
        this.A = A;
        this.B = B;
    }
    resolve(scope) {
        var a = this.A.resolve(scope).val;
        var b = this.B.resolve(scope).val;
        if(this.op == 'add') return new MNumber(a+b);
        if(this.op == 'sub') return new MNumber(a-b);
        if(this.op == 'mul') return new MNumber(a*b);
        if(this.op == 'div') return new MNumber(a/b);
    }
}

BinOp accepts the operation and two values to perform the operation on (called operands in math terms). The resolve method will call resolve on the two operands, pull out the underlying Javascript values, then return a new MNumber by combining them into new values. We could skip calling resolve on the operands because resolve() on a plain number just returns itself. However, I included the resolve call here because later on the operand might not be a number. It might be a symbol or function instead. Defining everything in terms of resolve keeps the code future-proof.

Now we can create our new semantics operation called toAST(). Keep the existing operation, calc, in place. Add to it by creating a second semantics.

var semantics = grammar.createSemantics();
var Calculator = semantics.addOperation('calc', {
   ...
});
var ASTBuilder = semantics.addOperation('toAST', {
    AddExpr_plus:  (a, _, b) => new BinOp('add', a.toAST(), b.toAST()),
    AddExpr_minus: (a, _, b) => new BinOp('sub', a.toAST(), b.toAST()),
    MulExpr_times: (a, _, b) => new BinOp('mul', a.toAST(), b.toAST()),
    MulExpr_divide:(a, _, b) => new BinOp('div', a.toAST(), b.toAST()),
    PriExpr_paren: (_,a,__)  => a.toAST(),
    //reuse the number literal parsing code from `calc` operation
    Number : function(a) { return new MNumber(a.calc()); }
});

As before, most actions recursively call toAST() on the argument. However, the action for Number actually calls calc instead of toAST. The Calculator semantics already define how to parse numbers. We don't need to write that part again. Instead we delegate to the existing calc operation.

This delegation system is a key part of Ohm's design. You can have multiple semantic operations which call each other, as long as they are in the same set of semantics. This is another way to reuse code across parsers. If we one day want to extend our math language further we could do it by creating additional semantic operations instead of modifying the originals.

With the toAST semantics in place we can now rewrite the test code like this:

function test(input, answer) {
    var match = grammar.match(input);
    if(match.failed()) return console.log("input failed to match " + input + match.message);
    var ast = ASTBuilder(match).toAST();
    var result = ast.resolve();
    console.log('result = ', result);
    assert.deepEqual(result.jsEquals(answer),true);
    console.log('success = ', result, answer);
}

Calling toAST() returns an expression object instead of a value. Then we can call resolve on this object to get the final value. This might seem like a lot of work for what is fundamentally the same behavior as our earlier calc operation that did the arithmetic inline. The reason we did all this is to lay the ground for a more advanced feature: symbols.

Adding Symbols

To support variables we need a concept called a symbol. A symbol is just a name, or identifier, which points to a real value. In many cases we could use a real number value instead of a symbol, but symbols give us a special ability: symbols can be redefined. You can write x=2 and x*5 to get 10. Then write x=3 and call x*5 again to get 15. The same code can be invoked multiple times with different results by changing what the symbol points to. This is one of the fundamental concepts of computer science which makes computation possible. Let's start by creating an MSymbol class (I didn't use the name Symbol because that will clash with the future native Symbol class in Javascript).

class MSymbol {
    constructor(name) {
        this.name = name;
    }
    resolve(scope) {
        return scope.getSymbol(this.name);
    }
}

Now we need a place to actually sort what the symbols point to. This is called a scope. For now we will have only one scope called GLOBAL, but in the future we will have more.

class Scope {
    constructor() {
        this.storage = {};
    }
    setSymbol(sym, obj) {
        this.storage[sym.name] = obj;
        return this.storage[sym.name];
    }
    getSymbol(name) {
        if(this.storage[name]) return this.storage[name];
        return null;
    }
}

Now we can create the Assignment operator which actually sets the symbol's value.

class Assignment {
    constructor(sym,val) {
        this.symbol = sym;
        this.val = val;
    }
    resolve(scope) {
        return scope.setSymbol(this.symbol, this.val.resolve(scope));
    }
}

With the various classes in place we just need to update the grammar and our semantics to support it.

Update the grammar like this:

CoolNums {
    // just a basic integer
    Expr =  Assign | AddExpr | Identifier | Number
    
    AddExpr = AddExpr "+" MulExpr -- plus
          | AddExpr "-" MulExpr -- minus
          | MulExpr
    
    MulExpr = MulExpr "*" PriExpr -- times
          | MulExpr "/" PriExpr -- divide
          | PriExpr
    
    PriExpr = "(" Expr ")" -- paren
          | Identifier
          | Number
    
    Assign = Identifier "=" Expr
    Identifier = letter (letter|digit)*
    
    Number = oct | hex | float | int
    int    = digit+
    float  = digit+ "." digit+ exp?
    exp    = "e" "-"? digit+
    hex    = "0x" hexDigit+
    oct    = "0o" octDigit+
    octDigit = "0".."7"
    //hexDigit := "0".."9" | "a".."f" | "A".."F" //already defined by Ohm
}

Note that I've refactored the grammar slightly. Now Expr is one of Assign, AddExpr, Identifier, or Number. Identifier is a variable name that starts with a letter and contains letters or numbers. Assign is an identifier and an expression separated by the = character.

Add these two rules to the toAST semantics operation:

    Assign:        (a, _, b) => new Assignment(a.toAST(), b.toAST()),
    Identifier: function (a, b)  { return new MSymbol(this.sourceString, null) },

Note that we must turn on strict mode to use the new JavaScript class syntax in NodeJS by putting ‘use strict' at the top of the file. This also lets us use the arrow syntax for more compact rule definitions. Also Notice that I didn't use the arrow syntax for the Identifier rule because this needs to reference the this variable of the rule. The arrow syntax uses the this of the enclosing object, which is what we want in most cases but not this particular case.

Now we can add some more unit tests for our new variable syntax with symbols:

test('10',10);
test('x = 10',10);
test('x',10);
test('x * 2',20);
test('x * 0x2',20);

Conclusion

Building a language seems complex at first, but by breaking it down into small steps it becomes quite approachable. We've actually done most of the hard work already. We expanded our number parser into a full calculator, and then into a baby programming language by adding symbols and an AST. Next time we will add conditionals, loops, and function calls to turn this into a real programming language.

The code for this entire series is available at my github repo, but don't cheat by looking ahead! 🙂