Generating Sublime Text syntax definitions

This past year I started playing with the programming language Faust, and I wanted better syntax highlighting for it in Sublime Text, the text editor I use for programming. I didn’t want to write a sublime-syntax definition file by hand, since that’s pretty complicated, so instead I wrote a program to generate one automatically from a language’s formal grammar. That turned out to be complicated, too! But it was more to my taste.

Note: you can view the source code and documentation on Github, and install the program via pip install sublime-from-cfg (requires Python 3.9). For example, something like:

$ python3.9 -m pip install sublime-from-cfg
$ sublime-from-cfg my-language.sbnf -o my-language.sublime-syntax

The Faust syntax I generated is here.

Sublime Text’s .sublime-syntax files

Syntax highlighting is already included in Sublime Text for many popular languages, such as (for example) C, where code will automatically get colored something like this:

if (true) {
    printf("Hello World!");
}

Notice that the if, printf, and "Hello World!" are colored differently. To create this syntax highlighting, a minimal .sublime-syntax file would look something like this:

%YAML 1.2
---
name: C
file_extensions: [c, h]
scope: source.c

contexts:
  main:
    - match: \b(if|else|for|while)\b
      scope: keyword.control.c
    - match: '"'
      push: string
    - match: (?=\b[[:alpha:]_][[:alnum:]_]*\b\()
      push: function-call

  string:
    - meta_scope: string.quoted.double.c
    - match: '"'
      pop: 1

  function-call:
    - match: \b[[:alpha:]_][[:alnum:]_]*\b
      scope: variable.function.c
    - match: \(
      pop: 1

The syntax highlighting engine has a context stack which always starts in the main context at the start of the file you’re editing. Whenever one of the match objects in a context matches (as a regular expression) the current text, a context can be pushed onto the stack, or some number of contexts can be popped, and the syntax engine advances to the next part of the edited file. For example, when a double quote appears, the string context is pushed onto the stack. Then the syntax engine advances until the next double quote appears, and then the string context is popped off the stack and the syntax engine returns to the main context.

These are just the most basic features available to authors of .sublime-syntax files. You can also do things like embed one language inside another (for example, to highlight an HTML <script> tag as JavaScript), do “nondeterministic” parsing by aborting when something unexpected happens and trying a different parsing strategy, and more.

Programming languages can be quite complex, and therefore the .sublime-syntax files get quite complex too. The official Sublime Text C language syntax definition file is (at the time of writing) 1320 lines long. JavaScript’s is 2533 lines long. Java’s is 3234 lines long.

Faust already has a language definition

The source code for the Faust programming language is freely available on Github. The parser for Faust is generated via a yacc source file. Yacc is a parser generator: a program that consumes a formal grammar for a programming language, and generates a parser (an ingredient of a compiler) for that language. It’s a compiler for compilers. The part that’s relevant to me is: that means that there’s already a concise high-level description of the Faust programming language. An excerpt from the link above:

program         : stmtlist
                ;

stmtlist        : /*empty*/
                | stmtlist variantlist statement
                ;

statement       : IMPORT LPAR uqstring RPAR ENDDEF
                | DECLARE name string  ENDDEF
                | DECLARE name name string  ENDDEF
                | definition
                | BDOC doc EDOC
                ;
...

This defines a valid Faust program. A word in lowercase is a nonterminal. A word in uppercase is a terminal. Nonterminals can be expanded by choosing one of the options on the right hand side; terminals (“terminal” in the sense of “final”) cannot be expanded any more.

To interpret the above: a Faust program starts at program. That gets expanded to stmtlist (the only option). stmtlist can either be expanded to the first option (an empty string), or to the second option (stmtlist variantlist statment). The first term there is stmtlist again. This is a recursive definition of stmtlist which means that stmtlist can refer to variantlist statement repeated any number of times. Skipping over variantlist (which I’ve omitted because it’s not very illuminating), statement can apparently be expanded in a few different ways – via some guesswork about what the various names mean, a statement can be either an import, a declaration (in two different ways), a definition, or a “doc” (in this case it’s documentation).

If you read the linked file above in more detail, you’ll see all the other nonterminals (name, string, definition, etc.) appear on the left hand side with their own expansion options.

Context-free grammars and parser generators

Language descriptions of this form, with nonterminals being expanded as strings of nonterminals and terminals, are known as context-free grammars. Many (most, even?) programming languages can be described with a context-free grammar. Context-free grammars are also studied in formal language theory, in theoretical linguistics and computer science.

Context-free grammars are also very convenient for people creating their own programming languages because parser generators like yacc can be used to generate parsers for them – the parser generator consumes the context-free grammar and produces a parser for the language.

That means that many programming languages will already have a context-free grammar lying around which describes them, either as part of the language’s source or as part of its specification.

Converting a CFG to a sublime syntax definition

So how do I convert a context-free grammar to a .sublime-syntax file? I’ll illustrate the process with some progressively more complicated examples. (Though I will leave out some complicating details.)

1. Only one production per nonterminal

First, say I have a simple context-free grammar where each nonterminal only has one production:

main : a b
     ;
a    : 'c' 'd'
     ;
b    : 'e' 'f'
     ;

Here main, a, and b are nonterminals, and 'c', 'd', 'e', and 'f' are terminals. This is a very boring language whose only valid string is cdef. That’s because main expands to the string of nonterminals a followed by b, each of a and b expands to a pair of terminals, and then expansion stops. There’s no alternate production available to choose at any stage.

To render this as a .sublime-syntax definition, I make a context for each nonterminal and each terminal:

main:
  - match: ''
    set: [b, a]
a:
  - match: ''
    set: [terminal-d, terminal-c]
b:
  - match: ''
    set: [terminal-f, terminal-e]
terminal-c:
  - match: c
    pop: 1
  - include: fail
terminal-d:
  - match: d
    pop: 1
  - include: fail
terminal-e:
  - match: e
    pop: 1
  - include: fail
terminal-f:
  - match: f
    pop: 1
  - include: fail
fail:
  - match: \S
    scope: invalid.illegal

The match: '' instruction means “always match”; set means “pop the current context and then push the following contexts”. So when the main context starts, it immediately pops and is replaced by the contexts b and a. That’s in the reverse order of the production main : a b, so that a will be at the top of the stack. Then likewise the a context is immediately popped and replaced by terminal-d and terminal-c corresponding to the production a: 'c' 'd'. The terminal-c context will match on the literal character c and then pop (making terminal-d the top of the stack), and any other non-whitespace character will get marked as invalid (that’s what the included fail context does – \S is the regular expression for “any non-whitespace character”). After c is matched, and then d for the terminal-d context, b is at the top of the stack and is replaced by [terminal-f, terminal-e], and so e and f get matched in turn.

2. Multiple productions with different initial sets

A slightly more complicated example, where a nonterminal can expand to different productions:

  main : a b
       ;
  a    : 'c' 'd'
+      | 'x'
       ;
  b    : 'e' 'f'
       ;

Now the nonterminal a can match either 'c' followed by 'd' or the terminal 'x'. This is a more interesting language, which can match two strings: cdef and xef. (Okay, it’s only a little more interesting.)

The main context still looks the same, and likewise b, because both of those nonterminals only have a single production, but I need to change the a context. I can do that by using a lookahead regular expression, which will match but not consume text, so that it can still be matched later:

a:
  - match: (?=c)
    set: [terminal-d, terminal-c]
  - match: (?=x)
    set: terminal-x
  - include: fail

Instead of unconditionally replacing the current context with [terminal-d, terminal-c], I set the contexts corresponding to the next character. If the next character is c then the (?=c) regex matches and the context is replaced by [terminal-d, terminal-c]. If instead the next character is x then the (?=x) regex matches and the context is replaced by terminal-x. If the next character is neither c nor x then this is not a valid string in my simple language: the fail context will match and mark the character invalid.

In this case, the productions for a both start with terminals, so it’s easy to see what the lookahead regex should be, but the productions could also start with nonterminals. If, for example, the language had the rule main : a b | 'z' ;, then I’d need to know whether to pick a b or 'z'. In that case, I’d recursively compute (for each nonterminal) the set of possible terminals it could start with. The nonterminal a can start with a 'c' or 'x', so a lookahead regex matching either of those characters should lead to picking the first production a b. A 'z' should lead to picking the second production 'z'. (I’m skipping over some details that apply when a production might match the empty string – see the wikipedia page on LL parsers for lots more.)

3. Multiple productions with overlapping initial set

The most interesting and complicated situation is where multiple productions might start with the same terminal, and so I can’t use a lookahead regular expression to decide which to choose.

For example:

  main : a b
       ;
  a    : 'c' 'd'
+      | 'c' 'z'
       | 'x'
       ;
  b    : 'e' 'f'
       ;

This is still a very simple language, with only three valid strings: cdef, czef, and xef. Now the context for a needs to change to accommodate the new production 'c' 'z', but unfortunately two productions now start with 'c' and the lookahead regular expression (?=c) won’t distinguish between them.

Before I explain how I handle this, though, I actually want to modify my approach slightly in the previous situation, and change how I handle unexpected characters:

  main:
+   - match: ''
+     push: [invalid, main/]
+ main/:
    - match: ''
-     set: [b, a]
+     set: [b, pop2, a]
  a:
    - match: (?=c)
-     set: [terminal-d, terminal-c]
+     set: [terminal-d, pop2, terminal-c]
    - match: (?=x)
      set: terminal-x
    - include: fail
  b:
    - match: ''
-     set: [terminal-f, terminal-e]
+     set: [terminal-f, pop2, terminal-e]
  terminal-c:
    - match: c
-     pop: 1
+     pop: 2
    - include: fail
  terminal-d:
    - match: d
-     pop: 1
+     pop: 2
    - include: fail
  terminal-e:
    - match: e
-     pop: 1
+     pop: 2
    - include: fail
  terminal-f:
    - match: f
-     pop: 1
+     pop: 2
    - include: fail
+ invalid:
+   - match: \S
+   - scope: invalid.illegal
+ pop2:
+   - match: ''
+     pop: 2
  fail:
-   - match: \S
-     scope: invalid.illegal
+   - match: (?=\S)
+     pop: 1

Reminder: this doesn’t add the new 'c' 'z' production yet. The changes are that (1) the symbols in each production are interleaved by the special context pop2 which unconditionally pops two contexts off the stack, (2) that each terminal context pops two contexts off the stack when it successfully matches, and (3) when a context fails then it pops one context off the stack without consuming any text. The effect is that when parsing a valid string, pushes and pops always happen in pairs, so everything works just as before: contexts that actually do something are separated by a buffer context (the pop2 context), and a success means popping twice. If an unexpected symbol appears, however, the pop: 1 instruction means that the next context is a pop2 context, which pops twice to another pop2 context, and so on, until the entire stack has been popped, leaving only the new invalid context which marks everything from then on invalid.

The whole point of this is accommodate nondeterministic parsing.

Sublime Text’s syntax engine allows nondeterministic parsing. Instead of pushing or setting a context, you can branch: when you branch, you provide a list of contexts to try, one after another. If, during one of those contexts, there’s a fail action, then the syntax engine rewinds back to where it was when it branched, and tries again with the next context. The full contexts corresponding to this example are below, which changes indicated from the previous version:

  main:
    - match: ''
      push: [invalid, main/]
  main/:
    - match: ''
      set: [b, pop2, a]
  a:
    - match: (?=c)
-     set: [terminal-d, pop2, terminal-c]
+     set: a@01
    - match: (?=x)
      set: terminal-x
    - include: fail
  b:
    - match: ''
      set: [terminal-f, pop2, terminal-e]
+ a@01:
+   - match: ''
+     branch_point: a@01
+     branch: [a@01@0, a@01@1]
+ a@01@0:
+   - match: ''
+     set: [pop3, a@01@fail, terminal-d, pop2, terminal-c]
+ a@01@1:
+   - match: ''
+     set: [pop3, pop3, terminal-z, pop2, terminal-c]
+ a@01@fail:
+   - match: ''
+     fail: a@01
  terminal-c:
    - match: c
      pop: 2
    - include: fail
  # other terminals omitted for brevity: d, e, f
+ terminal-z:
+   - match: z
+     pop: 2
+   - include: fail
  invalid:
    - match: \S
      scope: invalid.illegal
  pop2:
    - match: ''
      pop: 2
+ pop3:
+   - match: ''
+     pop: 3
  fail:
    - match: (?=\S)
      pop: 1

Whenever a lookahead might match more than one production, then I create a special branch context (here, a@01) to try each production in sequence. The context for each production will push a pair of contexts to handle success and failure respectively, and then the contexts corresponding to the symbols for that production. The success context is pop3, which just pops out of the branch context (and skips over the pop2 buffer), and the failure context triggers the special “try the next branch” action. The last try for each branch doesn’t have that special trigger (because there’s no other production to try), so it just pops out to the pop2 so that the next failure up the stack can be triggered.

Further details

And that’s essentially it! I’ve left out a few details for the sake of keeping the story relatively simple, but there are a few more tricks I use in ambiguous situations. There are some grammars that I can’t handle – for example, left-recursive grammars – and you can read a few more details about my approach on the Github page for the project. The general parsing approach is due to Adrian Johnstone and Elizabeth Scott, though adapting that approach to generating a .sublime-syntax file is my own work. The file format I use to represent context-free grammars is from Benjamin Schaaf’s sbnf, which is a very similar project to mine (though implemented entirely differently).