Generating Sublime Text syntax definitions
This past year I started playing with the programming language Faust, and I wanted better syntax highlighting for it in Sublime Text, the text editor I use for programming. I didn’t want to write a sublime-syntax definition file by hand, since that’s pretty complicated, so instead I wrote a program to generate one automatically from a language’s formal grammar. That turned out to be complicated, too! But it was more to my taste.
Note: you can view the source code and documentation on Github, and install the program via pip install sublime-from-cfg
(requires Python 3.9). For example, something like:
$ python3.9 -m pip install sublime-from-cfg
$ sublime-from-cfg my-language.sbnf -o my-language.sublime-syntax
The Faust syntax I generated is here.
Sublime Text’s .sublime-syntax
files
Syntax highlighting is already included in Sublime Text for many popular languages, such as (for example) C, where code will automatically get colored something like this:
if (true) {
printf("Hello World!");
}
Notice that the if
, printf
, and "Hello World!"
are colored differently. To create this syntax highlighting, a minimal .sublime-syntax
file would look something like this:
%YAML 1.2
---
name: C
file_extensions: [c, h]
scope: source.c
contexts:
main:
- match: \b(if|else|for|while)\b
scope: keyword.control.c
- match: '"'
push: string
- match: (?=\b[[:alpha:]_][[:alnum:]_]*\b\()
push: function-call
string:
- meta_scope: string.quoted.double.c
- match: '"'
pop: 1
function-call:
- match: \b[[:alpha:]_][[:alnum:]_]*\b
scope: variable.function.c
- match: \(
pop: 1
The syntax highlighting engine has a context stack which always starts in the main
context at the start of the file you’re editing. Whenever one of the match objects in a context matches (as a regular expression) the current text, a context can be pushed onto the stack, or some number of contexts can be popped, and the syntax engine advances to the next part of the edited file. For example, when a double quote appears, the string
context is pushed onto the stack. Then the syntax engine advances until the next double quote appears, and then the string
context is popped off the stack and the syntax engine returns to the main
context.
These are just the most basic features available to authors of .sublime-syntax
files. You can also do things like embed one language inside another (for example, to highlight an HTML <script>
tag as JavaScript), do “nondeterministic” parsing by aborting when something unexpected happens and trying a different parsing strategy, and more.
Programming languages can be quite complex, and therefore the .sublime-syntax
files get quite complex too. The official Sublime Text C language syntax definition file is (at the time of writing) 1320 lines long. JavaScript’s is 2533 lines long. Java’s is 3234 lines long.
Faust already has a language definition
The source code for the Faust programming language is freely available on Github. The parser for Faust is generated via a yacc source file. Yacc is a parser generator: a program that consumes a formal grammar for a programming language, and generates a parser (an ingredient of a compiler) for that language. It’s a compiler for compilers. The part that’s relevant to me is: that means that there’s already a concise high-level description of the Faust programming language. An excerpt from the link above:
program : stmtlist
;
stmtlist : /*empty*/
| stmtlist variantlist statement
;
statement : IMPORT LPAR uqstring RPAR ENDDEF
| DECLARE name string ENDDEF
| DECLARE name name string ENDDEF
| definition
| BDOC doc EDOC
;
...
This defines a valid Faust program. A word in lowercase is a nonterminal. A word in uppercase is a terminal. Nonterminals can be expanded by choosing one of the options on the right hand side; terminals (“terminal” in the sense of “final”) cannot be expanded any more.
To interpret the above: a Faust program starts at program
. That gets expanded to stmtlist
(the only option). stmtlist
can either be expanded to the first option (an empty string), or to the second option (stmtlist variantlist statment
). The first term there is stmtlist
again. This is a recursive definition of stmtlist
which means that stmtlist
can refer to variantlist statement
repeated any number of times. Skipping over variantlist
(which I’ve omitted because it’s not very illuminating), statement
can apparently be expanded in a few different ways – via some guesswork about what the various names mean, a statement can be either an import, a declaration (in two different ways), a definition, or a “doc” (in this case it’s documentation).
If you read the linked file above in more detail, you’ll see all the other nonterminals (name
, string
, definition
, etc.) appear on the left hand side with their own expansion options.
Context-free grammars and parser generators
Language descriptions of this form, with nonterminals being expanded as strings of nonterminals and terminals, are known as context-free grammars. Many (most, even?) programming languages can be described with a context-free grammar. Context-free grammars are also studied in formal language theory, in theoretical linguistics and computer science.
Context-free grammars are also very convenient for people creating their own programming languages because parser generators like yacc can be used to generate parsers for them – the parser generator consumes the context-free grammar and produces a parser for the language.
That means that many programming languages will already have a context-free grammar lying around which describes them, either as part of the language’s source or as part of its specification.
Converting a CFG to a sublime syntax definition
So how do I convert a context-free grammar to a .sublime-syntax
file? I’ll illustrate the process with some progressively more complicated examples. (Though I will leave out some complicating details.)
1. Only one production per nonterminal
First, say I have a simple context-free grammar where each nonterminal only has one production:
main : a b
;
a : 'c' 'd'
;
b : 'e' 'f'
;
Here main
, a
, and b
are nonterminals, and 'c'
, 'd'
, 'e'
, and 'f'
are terminals. This is a very boring language whose only valid string is cdef
. That’s because main
expands to the string of nonterminals a
followed by b
, each of a
and b
expands to a pair of terminals, and then expansion stops. There’s no alternate production available to choose at any stage.
To render this as a .sublime-syntax
definition, I make a context for each nonterminal and each terminal:
main:
- match: ''
set: [b, a]
a:
- match: ''
set: [terminal-d, terminal-c]
b:
- match: ''
set: [terminal-f, terminal-e]
terminal-c:
- match: c
pop: 1
- include: fail
terminal-d:
- match: d
pop: 1
- include: fail
terminal-e:
- match: e
pop: 1
- include: fail
terminal-f:
- match: f
pop: 1
- include: fail
fail:
- match: \S
scope: invalid.illegal
The match: ''
instruction means “always match”; set
means “pop the current context and then push the following contexts”. So when the main
context starts, it immediately pops and is replaced by the contexts b
and a
. That’s in the reverse order of the production main : a b
, so that a
will be at the top of the stack. Then likewise the a
context is immediately popped and replaced by terminal-d
and terminal-c
corresponding to the production a: 'c' 'd'
. The terminal-c
context will match on the literal character c
and then pop (making terminal-d
the top of the stack), and any other non-whitespace character will get marked as invalid (that’s what the included fail
context does – \S
is the regular expression for “any non-whitespace character”). After c
is matched, and then d
for the terminal-d
context, b
is at the top of the stack and is replaced by [terminal-f, terminal-e]
, and so e
and f
get matched in turn.
2. Multiple productions with different initial sets
A slightly more complicated example, where a nonterminal can expand to different productions:
main : a b
;
a : 'c' 'd'
+ | 'x'
;
b : 'e' 'f'
;
Now the nonterminal a
can match either 'c'
followed by 'd'
or the terminal 'x'
. This is a more interesting language, which can match two strings: cdef
and xef
. (Okay, it’s only a little more interesting.)
The main
context still looks the same, and likewise b
, because both of those nonterminals only have a single production, but I need to change the a
context. I can do that by using a lookahead regular expression, which will match but not consume text, so that it can still be matched later:
a:
- match: (?=c)
set: [terminal-d, terminal-c]
- match: (?=x)
set: terminal-x
- include: fail
Instead of unconditionally replacing the current context with [terminal-d, terminal-c]
, I set the contexts corresponding to the next character. If the next character is c
then the (?=c)
regex matches and the context is replaced by [terminal-d, terminal-c]
. If instead the next character is x
then the (?=x)
regex matches and the context is replaced by terminal-x
. If the next character is neither c
nor x
then this is not a valid string in my simple language: the fail
context will match and mark the character invalid.
In this case, the productions for a
both start with terminals, so it’s easy to see what the lookahead regex should be, but the productions could also start with nonterminals. If, for example, the language had the rule main : a b | 'z' ;
, then I’d need to know whether to pick a b
or 'z'
. In that case, I’d recursively compute (for each nonterminal) the set of possible terminals it could start with. The nonterminal a
can start with a 'c'
or 'x'
, so a lookahead regex matching either of those characters should lead to picking the first production a b
. A 'z'
should lead to picking the second production 'z'
. (I’m skipping over some details that apply when a production might match the empty string – see the wikipedia page on LL parsers for lots more.)
3. Multiple productions with overlapping initial set
The most interesting and complicated situation is where multiple productions might start with the same terminal, and so I can’t use a lookahead regular expression to decide which to choose.
For example:
main : a b
;
a : 'c' 'd'
+ | 'c' 'z'
| 'x'
;
b : 'e' 'f'
;
This is still a very simple language, with only three valid strings: cdef
, czef
, and xef
. Now the context for a
needs to change to accommodate the new production 'c' 'z'
, but unfortunately two productions now start with 'c'
and the lookahead regular expression (?=c)
won’t distinguish between them.
Before I explain how I handle this, though, I actually want to modify my approach slightly in the previous situation, and change how I handle unexpected characters:
main:
+ - match: ''
+ push: [invalid, main/]
+ main/:
- match: ''
- set: [b, a]
+ set: [b, pop2, a]
a:
- match: (?=c)
- set: [terminal-d, terminal-c]
+ set: [terminal-d, pop2, terminal-c]
- match: (?=x)
set: terminal-x
- include: fail
b:
- match: ''
- set: [terminal-f, terminal-e]
+ set: [terminal-f, pop2, terminal-e]
terminal-c:
- match: c
- pop: 1
+ pop: 2
- include: fail
terminal-d:
- match: d
- pop: 1
+ pop: 2
- include: fail
terminal-e:
- match: e
- pop: 1
+ pop: 2
- include: fail
terminal-f:
- match: f
- pop: 1
+ pop: 2
- include: fail
+ invalid:
+ - match: \S
+ - scope: invalid.illegal
+ pop2:
+ - match: ''
+ pop: 2
fail:
- - match: \S
- scope: invalid.illegal
+ - match: (?=\S)
+ pop: 1
Reminder: this doesn’t add the new 'c' 'z'
production yet. The changes are that (1) the symbols in each production are interleaved by the special context pop2
which unconditionally pops two contexts off the stack, (2) that each terminal context pops two contexts off the stack when it successfully matches, and (3) when a context fails then it pops one context off the stack without consuming any text. The effect is that when parsing a valid string, pushes and pops always happen in pairs, so everything works just as before: contexts that actually do something are separated by a buffer context (the pop2
context), and a success means popping twice. If an unexpected symbol appears, however, the pop: 1
instruction means that the next context is a pop2
context, which pops twice to another pop2
context, and so on, until the entire stack has been popped, leaving only the new invalid
context which marks everything from then on invalid.
The whole point of this is accommodate nondeterministic parsing.
Sublime Text’s syntax engine allows nondeterministic parsing. Instead of pushing or setting a context, you can branch: when you branch, you provide a list of contexts to try, one after another. If, during one of those contexts, there’s a fail action, then the syntax engine rewinds back to where it was when it branched, and tries again with the next context. The full contexts corresponding to this example are below, which changes indicated from the previous version:
main:
- match: ''
push: [invalid, main/]
main/:
- match: ''
set: [b, pop2, a]
a:
- match: (?=c)
- set: [terminal-d, pop2, terminal-c]
+ set: a@01
- match: (?=x)
set: terminal-x
- include: fail
b:
- match: ''
set: [terminal-f, pop2, terminal-e]
+ a@01:
+ - match: ''
+ branch_point: a@01
+ branch: [a@01@0, a@01@1]
+ a@01@0:
+ - match: ''
+ set: [pop3, a@01@fail, terminal-d, pop2, terminal-c]
+ a@01@1:
+ - match: ''
+ set: [pop3, pop3, terminal-z, pop2, terminal-c]
+ a@01@fail:
+ - match: ''
+ fail: a@01
terminal-c:
- match: c
pop: 2
- include: fail
# other terminals omitted for brevity: d, e, f
+ terminal-z:
+ - match: z
+ pop: 2
+ - include: fail
invalid:
- match: \S
scope: invalid.illegal
pop2:
- match: ''
pop: 2
+ pop3:
+ - match: ''
+ pop: 3
fail:
- match: (?=\S)
pop: 1
Whenever a lookahead might match more than one production, then I create a special branch context (here, a@01
) to try each production in sequence. The context for each production will push a pair of contexts to handle success and failure respectively, and then the contexts corresponding to the symbols for that production. The success context is pop3
, which just pops out of the branch context (and skips over the pop2
buffer), and the failure context triggers the special “try the next branch” action. The last try for each branch doesn’t have that special trigger (because there’s no other production to try), so it just pops out to the pop2
so that the next failure up the stack can be triggered.
Further details
And that’s essentially it! I’ve left out a few details for the sake of keeping the story relatively simple, but there are a few more tricks I use in ambiguous situations. There are some grammars that I can’t handle – for example, left-recursive grammars – and you can read a few more details about my approach on the Github page for the project. The general parsing approach is due to Adrian Johnstone and Elizabeth Scott, though adapting that approach to generating a .sublime-syntax
file is my own work. The file format I use to represent context-free grammars is from Benjamin Schaaf’s sbnf, which is a very similar project to mine (though implemented entirely differently).