Grammar language

This section describe the grammar language, its syntax and semantics rules.

The Rustemo grammar specification language is based on BNF with syntactic sugar extensions which are optional and build on top of a pure BNF. Rustemo grammars are based on Context-Free Grammars (CFGs) and are written declaratively. This means you don't have to think about the parsing process like in e.g. PEGs. Ambiguities are dealt with explicitly (see the section on conflicts).

The structure of the grammar

Each grammar file consists of two parts:

derivation/production rules,
terminal definitions which are written after the keyword terminals.

Each derivation/production rule is of the form:

<symbol>: <expression> ;

where <symbol> is a grammar non-terminal and <expression> is one or more sequences of grammar symbol references separated by choice operator |.

For example:

Fields: Field | Fields "," Field;

Here Fields is a non-terminal grammar symbol and it is defined as either a single Field or, recursively, as Fields followed by a string terminal , and than by another Field. It is not given here but Field could also be defined as a non-terminal. For example:

Field: QuotedField | FieldContent;

Or it could be defined as a terminal in terminals section:

terminals
Field: /[A-Z]*/;

This terminal definition uses regular expression recognizer.

Terminals

Terminal symbols of the grammar define the fundamental or atomic elements of your language, tokens or lexemes (e.g. keywords, numbers).

Terminals are specified at the end of the grammar file, after production rules, following the keyword terminals.

Tokens are recognized from the input by a lexer component. Rustemo provides a string lexer out-of-the-box which enable lexing based on recognizers provided in the grammar. If more control is needed, or if non-textual context has been parsed a custom lexer must be provided. See the lexers section for more.

Each terminal definition is in the form:

<terminal name>: <recognizer>;

where <recognizer> can be omitted if custom lexer is used.

The default string lexer enables specification of two kinds of terminal recognizers:

String recognizer
Regex recognizer

String recognizer

String recognizer is defined as a plain string inside single or double quotes. For example, in a grammar rule:

MyRule: "start" OtherRule "end";

"start" and "end" will be terminals with string recognizers that match exactly the words start and end. In this example we have recognizers inlined in the grammar rule.

For each string recognizer you must provide its definition in the terminals section in order to define a terminal name.

terminals
Start: "start";
End: "end";

You can reference the terminal from the grammar rule, like:

MyRule: Start OtherRule End;

or use the same string recognizer inlined in the grammar rules, like we have seen before. It is your choice. Sometimes it is more readable to use string recognizers directly. But, anyway you must always declare the terminal in the terminals section for the sake of providing names which are used in the code of the generated parser.

Regular expression recognizer

Or regex recognizer for short is a regex pattern written inside slashes (/.../).

For example:

terminals
Number: /\d+/;

This rule defines terminal symbol Number which has a regex recognizer that will recognize one or more digits from the input.

Note

You cannot write regex recognizers inline like you can do with string recognizers. This constraint is introduced because regexes are not that easy to write and they don't add to readability so it is always better to reference regex terminal by name in grammar rules.

Warning

During regex construction a ^ prefix is added to the regex from the grammar to make sure that the content is matched at the current input position. This can be an issue if you use a pattern like A|B in your regex as it translates to ^A|B which matches either A at the current position or B in the rest of the input. So, the workaround for now is to use (A|B), i.e. always wrap alternative choices in parentheses.

Usual patterns

This section explains how some common grammar patterns can be written using just a plain Rustemo BNF-like notation. Afterwards we'll see some syntax sugar extensions which can be used to write these patterns in a more compact and readable form.

One or more

This pattern is used to match one or more things.

For example, Sections rule below will match one or more Section.

Sections: Section | Sections Section;

Notice the recursive definition of the rule. You can read this as

Sections is either a single Section or Sections followed by a Section.

Note

Please note that you could do the same with this rule:

Sections: Section | Section Sections;

which will give you similar result but the resulting tree will be different. Notice the recursive reference is now at the end of the second production.

Previous example will reduce sections early and then add another section to it, thus the tree will be expanding to the left. The example in this note will collect all the sections and than start reducing from the end, thus building a tree expanding to the right. These are subtle differences that are important when you start writing your semantic actions. Most of the time you don't care so use the first version as it is more efficient in the context of the LR parsing.

Zero or more

This pattern is used to match zero or more things.

For example, Sections rule below will match zero or more Section.

Sections: Section | Sections Section | EMPTY;

Notice the addition of the EMPTY choice at the end. This means that matching nothing is a valid Sections non-terminal. Basically, this rule is the same as one-or-more except that matching nothing is also a valid solution.

Same note from the above applies here to.

Optional

When we want to match something optionally we can use this pattern:

OptHeader: Header | EMPTY;

In this example OptHeader is either a Header or nothing.

Syntactic sugar - BNF extensions

Previous section gives the overview of the basic BNF syntax. If you got to use various BNF extensions (like Kleene star) you might find writing patterns in the previous section awkward. Since some of the patterns are used frequently in the grammars (zero-or-more, one-or-more etc.) Rustemo provides syntactic sugar for this common idioms using a well known regular expression syntax.

Optional

Optional match can be specified using ?. For example:

A: 'c'? B Num?;
B: 'b';

terminals

Tb: 'b';
Tc: 'c';
Num: /\d+/;

Here, we will recognize B which is optionally preceded with c and followed by Num.

Lets see what the parser will return optional inputs.

In this test:

#![allow(unused)]
fn main() {
#[test]
fn optional_1_1() {
    let result = Optional1Parser::new().parse("c b 1");
    output_cmp!(
        "src/sugar/optional/optional_1_1.ast",
        format!("{result:#?}")
    );
}
}

for input c b 1 the result will be:

Ok(
    A {
        tc_opt: Some(
            Tc,
        ),
        b: Tb,
        num_opt: Some(
            "1",
        ),
    },
)

If we leave the number out and try to parse c b, the parse will succeed and the result will be:

Ok(
    A {
        tc_opt: Some(
            Tc,
        ),
        b: Tb,
        num_opt: None,
    },
)

Notice that returned type is A struct with fields tc_opt and num_opt of Optional type. These types are auto-generated based on the grammar. To learn more see section on AST types/actions code generation.

Note

Syntax equivalence for optional operator

S: B?;

terminals
B: "b";

is equivalent to:

S: BOpt;
BOpt: B | EMPTY;

terminals
B: "b";

Behind the scenes Rustemo will create BOpt rule. All syntactic sugar additions operate by creating additional rules in the grammar during parser compilation.

One or more

One-or-more match is specified using + operator.

For example:

A: 'c' B+ Ta;
B: Num;

terminals

Ta: 'a';
Tc: 'c';
Num: /\d+/;

After c we expect to see one or more B (which will match a number) and at the end we expect a.

Let's see what the parser will return for input c 1 2 3 4 a:

#![allow(unused)]
fn main() {
#[test]
fn one_or_more_2_2() {
    let result = OneOrMore2Parser::new().parse("c 1 2 3 4 a");
    output_cmp!(
        "src/sugar/one_or_more/one_or_more_2_2.ast",
        format!("{result:#?}")
    );
}
}

The result will be:

Ok(
    [
        "1",
        "2",
        "3",
        "4",
    ],
)

Note

We see in the previous example that default AST building actions will drop string matches as fixed content is not interesting for analysis and usually represent syntax noise which is needed only for performing correct parsing. Also, we see that one-or-more will be transformed to a Vec of matched values (using the vec annotation, see bellow). Of, course, this is just the default. You can change it to fit your needs. To learn more see the section on builders.

Note

Syntax equivalence for one or more:

S: A+;

terminals
A: "a";

is equivalent to:

S: A1;
@vec
A1: A1 A | A;

terminals
A: "a";

Zero or more

Zero-or-more match is specified using * operator.

For example:

A: 'c' B* Ta;
B: Num;

terminals

Ta: 'a';
Tc: 'c';
Num: /\d+/;

This syntactic sugar is similar to + except that it doesn't require rule to match at least once. If there is no match, resulting sub-expression will be an empty list.

Let's see what the parser based on the given grammar will return for input c 1 2 3 a.

#![allow(unused)]
fn main() {
#[test]
fn zero_or_more_2_1() {
    let result = ZeroOrMore2Parser::new().parse("c 1 2 3 a");
    output_cmp!(
        "src/sugar/zero_or_more/zero_or_more_2_1.ast",
        format!("{result:#?}")
    );
}
}

The result will be:

Ok(
    Some(
        [
            "1",
            "2",
            "3",
        ],
    ),
)

But, contrary to one-or-more we may match zero times. For example, if input is c a we get:

Ok(
    None,
)

Note

Syntax equivalence for zero or more:

S: A*;

terminals
A: "a";

is equivalent to:

S: A0;
@vec
A0: A1 {nops} | EMPTY;
@vec
A1: A1 A | A;

terminals
A: "a";

So using of * creates both A0 and A1 rules. Action attached to A0 returns a list of matched a and empty list if no match is found. Please note the usage of nops. In case prefer_shift strategy is used, using nops will perform both REDUCE and SHIFT during GLR parsing if what follows zero or more might be another element in the sequence. This is most of the time what you need.

Repetition modifiers

Repetitions (+, *, ?) may optionally be followed by a modifier in square brackets. Currently, this modifier can only be used to define a separator. The separator is defined as a terminal rule reference.

For example, for this grammar:

A: 'c' B Num+[Comma];
B: 'b' | EMPTY;

terminals
Num: /\d+/;
Comma: ',';
Tb: 'b';
Tc: 'c';

We expect to see c, followed by optional B, followed by one or more numbers separated by a comma (Num+[Comma]).

If we give input c b 1, 2, 3, 4 to the parser:

#![allow(unused)]
fn main() {
#[test]
fn one_or_more_1_1_sep() {
    let result = OneOrMore1SepParser::new().parse("c b 1, 2, 3, 4");
    output_cmp!(
        "src/sugar/one_or_more/one_or_more_1_1_sep.ast",
        format!("{result:#?}")
    );
}
}

we get this output:

Ok(
    A {
        b: Some(
            Tb,
        ),
        num1: [
            "1",
            "2",
            "3",
            "4",
        ],
    },
)

Note

Syntax equivalence of one or more with separator :

S: A+[Comma];

terminals
A: "a";
Comma: ",";

is equivalent to:

S: A1Comma;
@vec
A1Comma: A1Comma Comma A | A;

terminals
A: "a";
Comma: ",";

Making the name of the separator rule a suffix of the additional rule name makes sure that only one additional rule will be added to the grammar for all instances of A+[Comma], i.e. same base rule with the same separator.

Parenthesized groups

Danger

This is not yet implemented.

You can use parenthesized groups at any place you can use a rule reference. For example:

S: a (b* a {left} | b);
terminals
a: "a";
b: "b";

Here, you can see that S will match a and then either b* a or b. You can also see that meta-data can be applied at a per-sequence level (in this case {left} applies to sequence b* a).

Here is a more complex example which uses repetitions, separators, assignments and nested groups.

S: (b c)*[comma];
S: (b c)*[comma] a=(a+ (b | c)*)+[comma];
terminals
a: "a";
b: "b";
c: "c";
comma: ",";

Syntax equivalence `parenthesized groups`:

    S: c (b* c {left} | b);
    terminals
    c: "c";
    b: "b";

is equivalent to:

    S: c S_g1;
    S_g1: b_0 c {left} | b;
    b_0: b_1 | EMPTY;
    b_1: b_1 b | b;
    terminals
    c: "c";
    b: "b";

So using parenthesized groups creates additional `_g<n>` rules (`S_g1` in the
example), where `n` is a unique number per rule starting from `1`. All other
syntactic sugar elements applied to groups behave as expected.

Greedy repetitions

Danger

This is not yet implemented.

*, +, and ? operators have their greedy counterparts. To make an repetition operator greedy add ! (e.g. *!, +!, and ?!). These versions will consume as much as possible before proceeding. You can think of the greedy repetitions as a way to disambiguate a class of ambiguities which arises due to a sequence of rules where earlier constituent can match an input of various length leaving the rest to the next rule to consume.

Consider this example:

S: "a"* "a"*;

It is easy to see that this grammar is ambiguous, as for the input:

a a

We have 3 solutions:

1:S[0->3]
a_0[0->1]
    a_1[0->1]
    a[0->1, "a"]
a_0[2->3]
    a_1[2->3]
    a[2->3, "a"]
2:S[0->3]
a_0[0->0]
a_0[0->3]
    a_1[0->3]
    a_1[0->1]
        a[0->1, "a"]
    a[2->3, "a"]
3:S[0->3]
a_0[0->3]
    a_1[0->3]
    a_1[0->1]
        a[0->1, "a"]
    a[2->3, "a"]
a_0[3->3]

If we apply greedy zero-or-more to the first element of the sequence:

S: "a"*! "a"*;

We have only one solution where all a tokens are consumed by the first part of the rule:

S[0->3]
a_0[0->3]
    a_1[0->3]
    a_1[0->1]
        a[0->1, "a"]
    a[2->3, "a"]
a_0[3->3]

`EMPTY` built-in rule

There is a special EMPTY rule you can reference in your grammars. EMPTY rule will reduce without consuming any input and will always succeed, i.e. it is empty recognition.

Named matches (assignments)

In the section on builders you can see that struct fields deduced from rules, as well as generated semantic actions parameters, are named based on the <name>=<rule reference> part of the grammar. We call these named matches or assignments.

Named matches enable giving a name to a rule reference directly in the grammar.

In the calculator example:

E: left=E '+' right=E {Add, 1, left}
 | left=E '-' right=E {Sub, 1, left}
 | left=E '*' right=E {Mul, 2, left}
 | left=E '/' right=E {Div, 2, left}
 | base=E '^' exp=E {Pow, 3, right}
 | '(' E ')' {Paren}
 | Num {Num};

terminals

Plus: '+';
Sub: '-';
Mul: '*';
Div: '/';
Pow: '^';
LParen: '(';
RParen: ')';
Num: /\d+(\.\d+)?/;

we can see usage of assignments to name recursive references to E in the first four alternatives as left and right since we are defining binary operations, while the fifth alternative for power operation uses more descriptive names base and exp.

Now, with this in place, generated type for E and two operations (/ and ^), and the semantic action for + operation will be:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub struct Div {
    pub left: Box<E>,
    pub right: Box<E>,
}
#[derive(Debug, Clone)]
pub struct Pow {
    pub base: Box<E>,
    pub exp: Box<E>,
}
#[derive(Debug, Clone)]
pub enum E {
    Add(Add),
    Sub(Sub),
    Mul(Mul),
    Div(Div),
    Pow(Pow),
    Paren(Box<E>),
    Num(Num),
}
pub fn e_add(_ctx: &Ctx, left: E, right: E) -> E {
    E::Add(Add {
        left: Box::new(left),
        right: Box::new(right),
    })
}
}

Note

This is just a snippet from the calculator example for the sake of brevity.

Notice the names of fields in Div and Pow structs. Also, the name of parameters in e_add action. They are derived from the assignments.

Without the usage of assignments, the same generated types and action would be:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub struct Div {
    pub e_1: Box<E>,
    pub e_3: Box<E>,
}
#[derive(Debug, Clone)]
pub struct Pow {
    pub e_1: Box<E>,
    pub e_3: Box<E>,
}
#[derive(Debug, Clone)]
pub enum E {
    Add(Add),
    Sub(Sub),
    Mul(Mul),
    Div(Div),
    Pow(Pow),
    Paren(Box<E>),
    Num(Num),
}
pub fn e_add(_ctx: &Ctx, e_1: E, e_3: E) -> E {
    E::Add(Add {
        e_1: Box::new(e_1),
        e_3: Box::new(e_3),
    })
}
}

Where these names are based on the name of the referenced rule and the position inside the production.

Rule/production meta-data

Rules and productions may specify additional meta-data that can be used to guide parser construction decisions. Meta-data is specified inside curly braces right after the name of the rule, if it is a rule-level meta-data, or after the production body, if it is a production-level meta-data. If a meta-data is applied to the grammar rule it is in effect for all production of the rule, but if the same meta-data is defined for the production it takes precedence.

Note

See the example bellow.

Currently, kinds of meta-data used during parser construction are as follows:

disambiguation rules
production kinds
user meta-data

Disambiguation rules

These are special meta-data that are used during by Rustemo during grammar compilation to influence decision on LR automata states' actions.

Note

See sections on parsing and resolving LR conflicts.

There are some difference on which rules can be specified on the production and terminal level.

Disambiguation rules are the following:

priority - written as an integer number. Default priority is 10. Priority defined on productions have influence on both reductions on that production and shifts of tokens from that production. Priority defined on terminals influence the priority during tokenization. When multiple tokens can be recognized on the current location, those that have higher priority will be favored.
associativity - right/left or shift/reduce. When there is a state where competing shift/reduce operations could be executed this meta-data will be used to disambiguate. These meta-data can be specified on both productions and terminals level. If during grammar analysis there is a state where associativity is defined on both production and terminal the terminal associativity takes precedence.

Note

Rustemo book

Grammar language

The structure of the grammar

Terminals

String recognizer

Regular expression recognizer

Usual patterns

One or more

Zero or more

Optional

Syntactic sugar - BNF extensions

Optional

One or more

Zero or more

Repetition modifiers

Parenthesized groups

Greedy repetitions

`EMPTY` built-in rule

Named matches (assignments)

Rule/production meta-data

Disambiguation rules

Production kinds

User meta-data

Example

Rule annotations

Grammar comments

Handling keywords in your language

Handling whitespaces and comments (a.k.a Layout) in your language