Mixfix Operators & Parser Combinators, Part 2

In the previous post I introduced the notion of mixfix operators. In this post we will look at them more closely, in the context of an actual grammar. In the next part we will implement the parser for this grammar, look at performance issues and try to fix them with packrat parsers.

We will implement a simple language that consists of boolean algebra and integer arithmetic expressions. The grammar for the language looks like the following (we’re only considering tokens here and assume that a lexical parser has already identified literals, identifiers and delimiters in the text)

statement   ::= expression | declaration
declaration ::= variable ":=" expression
expression  ::= ??? | value
value       ::= literal | variable
literal     ::= booleanLiteral | integerLiteral
variable    ::= identifier

What should the expression productions look like, though? In examples of parsers and grammars we can commonly find an arithmetic expression language described with concepts of ‘factor’ and ‘term’ to create a precedence relation between addition and multiplication:

expression ::= (term "+")* term
term       ::= (factor "*")* factor
factor     ::= constant | variable
               | "(" expression ")"

This seems simple, but when we add more precedence rules, it can get quite complex, especially if we are writing a parser for a general purpose programming language instead of a simple expression language, and we also do semantic actions (create AST nodes) in the parser. This also makes the set of operators rather fixed: you might have to change several grammar productions to add a new operator with a new precedence level. I didn’t even try building Slang’s precedence rules into the grammar in this fashion.

Mixfix parsers still make the precedence part of the grammar, but there is a layer of abstraction there: we describe operators and their precedence rules as a directed graph, where (groups of) operators are the nodes and precedences are the edges. Then we instantiate the grammar with that particular precedence graph.

Before getting to the precedence rules in the language we are about to create, lets look at the operators it will have. In the list below, _ means a hole in the expression that can contain any other expression that “binds tighter” than the operator in question. In the case where the hole is closed on both left and right, it can contain any expression at all. Only a pair of parentheses forms a closed operator in this language.

( _ ) – parentheses
_ + _ – addition
_ - _ – subtraction
  - _ – negation
_ * _ – multiplication
_ / _ – division
_ ^ _ – exponent
_ mod _ – modulo/remainder
_ = _ – equality test
_ ≠ _ – inequality test
_ < _ – less than
_ > _ – greater than
_ & _ – conjunction
_ | _ – disjunction
  ! _ – logical not

This doesn’t include many common operators in real programming languages, but it is enough to demonstrate some interesting aspects of mixfix operators and using a DAG to describe their precedence relations. I used mod instead of % to show that operators don’t have to be symbols.

Before defining the precedence rules, lets look at some sample expressions and how we want them to be interpreted, mostly sticking with existing well known precedence rules, such as those in C, Java or Scala, but occasionally deviating from them:

a + b * c      = a + (b * c)
a < b & b < c  = (a < b) & (b < c)
-5 ^ 6         = (-5) ^ 6
a & !b | c     = (a & (!b)) | c
5 < 2 ≠ 6 > 3  = (5 < 2) ≠ (6 > 3)
1 < x & !x > 5 = (1 < x) & !(x > 5)

I think that’s enough examples for now. Lets try to describe the rules behind these somewhat intuitive expectations as a precedence graph. First, we’ll put the operators into groups where all operators in one group bind just as tightly as the others in the same group. For example 1 + 2 - 3 will be (1 + 2) - 3 and 1 - 2 + 3 will be (1 - 2) + 3

parentheses   : ()
negation      : - (prefix)
exponent      : ^
multiplication: *, /, mod
addition      : +, -
comparison    : <, >
equality      : =, ≠
not           : ! (prefix)
and           : &
or            : |

Negation (prefix -) is in it’s own group so that we can do: -2 + 1. If it was in the same group with infix - and +, then it couldn’t appear next to them without parentheses because prefix operators are treated as right-associative, but most infix operators, such as - and + are left-associative. And we can’t mix left-associative and right-associative operators of the same precedence level! Why? Take the expression

1 + 2 - 3

If + and - are left-associatve, it means (1 + 2) - 3.

If - is right-associative instead, then both (1 + 2) - 3 and 1 + (2 - 3) would be right!

We could read the list of operator groups above as an order of precedence, where the first group (parentheses) binds tightest and the last group (or) binds least tight. This would be mostly compatible with many programming languages and we would have a good enough set of precedence rules right there.

However, as mentioned earlier, Danielsson’s mixfix grammar scheme describes precedence relations as a directed graph. Each of the groups above is a node in the graph, and a directed edge from one node to another a -> b means: b binds tighter than a. So lets describe these relations as a graph instead — it will be in reverse order compared to the above list where we started from the most tightly binding:

or             -> and, not, equality, comparison, parentheses
and            -> not, equality, comparison, parentheses
not            -> equality, comparison, parentheses

equality       -> comparison, addition, multiplication, exponent, negation, parentheses
comparison     -> addition, multiplication, exponent, negation, parentheses

addition       -> multiplication, exponent, negation, parentheses
multiplication -> exponent, negation, parentheses
exponent       -> negation, parentheses
negation       -> parentheses

Notice that from each group we draw the edge not into a single group, but into all of the groups that bind tighter. This is because of the non-transitivity of precedence in this scheme: each pair of operator groups that is to have a precedence relation must have an edge between them in the graph. The advantage of this is that we don’t need to describe the precedence between operators that aren’t related at all. This is one of the motivations for using a directed graph to represent operator precedence.

I hope that from the names of the operators it was clear that some of them will apply only to booleans and some only to integers. For example, the & operator isn’t defined as bitwise &, only as logical conjunction. Thus, assuming that our language is strongly typed, some of the operators can’t appear in the holes of some other operators in a correct program.

A parser doesn’t do type checking of course, but with this mixfix grammar scheme, it does implicitly do precedence correctness checking. For example 4 + 5 & 6 + 4 is not precedence correct, as we didn’t define a precedence relation between addition and and. And due to the parser’s precedence checking, this expression will not even parse.

If we had used a total precedence order instead, we would have + binding tighter than &. The expression would be interpreted as (4 + 5) & (6 + 4) but would probably yield a type error as & works on booleans, but + works on integers. We could write (4 + 5) & (6 + 4) ourselves and that would also parse, because we made the precedence explicit. Well, actually parentheses follow the same rules: remember that in our graph, () bind tighter than everything.

The fact that the parser only produces precedence correct expressions can be both a blessing and a curse.

On one hand, this allows us to view some unrelated groups of operators almost as sublanguages. In our case, boolean algebra and integer arithmetic. This might be good for implementing internal DSLs in the presence of extremely flexible user-defined mixfix operators. We could allow users to extend our precedence graph or even replace it completely with their own. If a DSL has boolean logic in it, but no arithmetic, it might have precedence relations to logical operators, but not to arithmetic operators. This would preclude arithmetic operators from appearing in the DSL without being surrounded by parentheses. Or the DSL could even disallow parentheses. Implementing this much flexibility in a host language is complicated, though. For example, the parser would have to know about any custom mixfix grammars defined in imported modules.

On the other hand, this puts some correctness checks at the wrong level. Arguably, a parser should only validate the syntax of a program and nothing else. If a simple mistake such as using a wrong operator (equal to calling a non-existing method in some languages) would prevent the whole program from being parsed, it would also prevent the compiler from doing other interesting and useful things, or reporting better error messages.

So maybe this grammar scheme isn’t ideal for a general purpose programming language. I am sticking with it in Slang for now, because the scheme is relatively simple and works for me at least as long as I’m the only user of Slang :) And perhaps there are workarounds that would allow a precedence-incorrect expression to be accepted by the parser still. But I don’t have immediate plans to allow a wide variety of user-defined mixfix operators or operator precedence.

Anyway, for our simple language, I think this scheme works well enough as long as we don’t care whether it is the parser or the type checker that reports the errors in incorrect programs. There aren’t any useful direct precedence relations between boolean algebra operators and arithmetic operators here. Only by having equality, comparison or parentheses between them, can we put them in the same expression.

Lets look at one of the consequences of our rules more closely. Many languages, including Java and C, would put most prefix (unary) operators such as ! and - at the same level of precedence, binding tighter than all infix (binary) operators. In Java, !6 == 5 is a type error because the operator ! is bound to 6, not to 6 == 5, and ! isn’t defined on integers. In our language, it isn’t necessary to have ! at the same level as -, though. Since there is no (precedence) relation between logical and arithmetic operators, !6 + 5 will not parse. But ! does have a relation to comparison and equality tests (they bind tighter), so you can write !6 = 5 and it will mean !(6 = 5).

The precedence rules that have = binding tighter than boolean operators is based on the assumption that booleans are rarely compared to each other, but multiple comparisons of other types of values are often used in disjunctions, conjunctions and complements.

To get back to the question in the beginning of the post, what would the expression production in the grammar look like instead of expression ::= ??? | value? The short answer is that we replace ??? with the mixfix grammar scheme instantiated with our particular precedence graph. The long answer would probably take an entire blog post by itself. You can read more about this scheme in the Agda paper, or look at the source code of my mixfix library. The scheme looks somewhat like the parser combinators in the following pseudo-code (~ means sequential composition):

value = variable | literal
expression = mixfixGrammar(precedenceGraph) | value

mixfixGrammar(graph) = {
  // graph - the precedence graph
  // g - an operator group, node in the graph
  // op - an operator in a group

  ⋁(parsers) = // returns the result of the first parser in the list to succeed

  opsLeft(g)   = // all left-associative infix operators in g
  opsRight(g)  = // all right-associative infix operators in g
  opsNon(g)    = // all non-associative infix operators in g
  opsClosed(g) = // all closed operators in g
  opsPre(g)    = // all prefix operators in g
  opsPost(g)   = // all postfix operators in g

  operator(op) =
    if (op.internalArity == 0)
      op.namePart1
    if (op.internalArity == 1)
      op.namePart1 ~ expression ~ op.namePart2
      // expression is an recursive reference back to the "outer" production
      // these are the internal "holes" that can take any expression

  group(g)  = closed(g) | non(g) | left(g) | right(g)    // any ops in this group

  closed(g) = ⋁{ opsClosed(g) map operator }             // closed ops

  non(g)    = ↑(g) ~ ⋁{ opsNon(g) map operator } ~ ↑(g)  // non-associative ops

  left(g)   = (left(g) | ↑(g))                           // left-associative ops
              ~ ( ⋁{ opsPost(g) map operator }
                | ⋁{ opsLeft(g) map operator } ~ ↑(g) )

  right(g)  = ( ⋁{ opsPre(g) map operator }              // right-associative ops
              | ↑(g) ~ ⋁{ opsRight(g) map operator } )
              ~ (right(g) | ↑(g))

  ↑(g) = ⋁{ graph.groupsTighterThan(g) map group } // every group that binds tighter than g
         | value                                   // or the tightest "group" of values

  return ⋁{ graph.nodes map group }
}

If you don’t understand this right now, no big deal — it’s late enough that I couldn’t come up with a better representation of the actual code that would fit in this post. And if you are not familiar with parser combinators I would recommend reading Daniel Spiewak’s post on the subject, at least before continuing to the next part of this series.

If you notice, the value and expression productions are referenced inside the mixfixGrammar. This is no good if the mixfix library is to be a separate module, so I actually implemented that by introducing a pseudo operator group that has a custom parser. This pseudo-group is then added to the precedence graph along with edges from every other group into that “really tight” group.

This concludes part 2. In the next part we will forget this pseudo-code and use Scala’s parser combinators and my mixfix library to implement an actual parser for the language, and maybe an AST and an interpreter as well.

Thanks to Miles Sabin and Daniel Spiewak for reviewing drafts for this series of posts.

Advertisements

One thought on “Mixfix Operators & Parser Combinators, Part 2

  1. Pingback: Mixfix Operators & Parser Combinators, Bonus Part 2a « Villane

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s