Now that parser_lib is ready, it’s time to actually use it to parse Lua.

Lua lexemes

I usually parse from a string directly into a syntax tree. But this time I’ll do something different for a change: I’ll first parse a string into a vector of tokens.

Following Lua Lexical Conventions I defined lua lexemes here.

Everything is pretty straightforward apart from Keyword and OtherToken definitions. Probably the simplest way of parsing them is enumerating all of them and trying one after the other. So, I need a way to enumerate all the elements of an enum and also convert them into string. To do that I resorted to a declarative macro.

My macro that matches pattern $p:vis $n:ident { $($v:ident => $s:literal),* $(,)?}. First is an optional visibility modifier that’ll be bound to $p, next is an identifier bound to $n, then {, then a (possibly empty) list of comma separated pairs, that are optionally terminated by a comma and finally a }. A list of pairs are of the form identifier $v followed by => and then literal $s. What I wanted to express by all this is: this is an enum with visibility $p named $n and with items $v that map to strings $s. There’s a bunch of code generated from this macro:

  • Generate the enum itself.
$p enum $n {
    $($v),*
}
  • Count items. A hacky way: for each item $v generate a pair ($v, 1), extract the second item (i.e. 1) and sum all of them. Note, that function is mark const so it should (must?) be evaluated at compile time, producing an actual constant instead of a giant 1 + 1 + ... + 0 expression.
pub const fn items_count() -> usize {
    $(($n::$v, 1).1 + )* 0
}
  • Enumerate all the items.
pub const ITEMS: [$n; $n::items_count()] = [
    $($n::$v),*
];
  • Convert items to string.
pub fn to_str(&self) -> &'static str {
    match self {
      $($n::$v => $s),*
    }
}

Now, that I look back at it, I think it’s an overkill. What I really need is to enumerate all items, everything else can very well be done outside a macro.

Parsing Lua

And so here’s a lexer for Lua.

  • keyword_lexer. Pretty straightforward parsing: we iterate over all keywords and pick the first that matches. There’s one problem though: we have keywords else and elseif and if you match against else first, it’ll succeed. To tackle it I first sort (create_sorted_items) all the keywords by length (the longest first). This way elseif will be tried before else is.
  • other_token_lexer is implemented similarly.
  • string_literal_lexer. Borderline unreadable beast, though simple in essence: read an opening quote and then repeatedly read a character (except a closing quote) or an escape symbol, finally read a closing quote.
  • long_brackets_lexer. Parse the opening long bracket and then parse everything until you reach the closing long bracket.
  • number_literal_lexer. Parse the number as hexadecimal or as decimal with floating point. Notably, my parser only recognises valid numbers, the actual turning into numbers is done by Rust standard library.
  • identifier_lexer. Essentially a [a-zA-Z_][a-zA-Z_0-9]*.
  • token_lexer. Tries each of the lexer above. It’s important to try keyword_lexer before identifier_lexer, since the latter will match any keyword.
  • comment_lexer. Parse the beginning of the comment -- and then either parse a long bracket, or everything until end of line.
  • whitespace_lexer. Repeatedly parse whitespace.
  • tokens_lexer. Parses a possible comment or a whitespace. And then repeatedly uses token_lexer, followed by a comment or a whitespace.

A finishing touch

While writing this post I noticed a mistake that I made: tokens_lexer is happy to accept the following code:

local x;
x = "';

It produced a stream: local x ; x =. It correctly identified "'; as an unfinished string and stopped parsing right there. But it didn’t fail completely, because having an unconsumed input is not an error. Well, in the case of tokens_lexer it is, so I added

pub fn eof<'a, 'b: 'a>() -> Box<dyn Parser<'b, ()> + 'a> {
    Box::new(move |s| {
        if s.index == s.input.len() {
            Some(((), s))
        } else {
            None
        }
    })
}

to the parser_lib and used it in tokens_lexer. Now, when all the tokens were parsed, we check if it’s the end of the input stream, and if it’s not - we fail.

Also, I moved lua_syntax to lua_lexemes and lua_parser to lua_lexer, because I’ll need names syntax and parser when I actually parse a stream of lexemes into a syntax tree.

And finally I simplified plain_enum macro to only generate the enumeration bit.

The code is here.

Outline

  1. Lua in Rust: Introduction
  2. Lua in Rust: Combinatory parsing
  3. Lua in Rust: Combinatory parsing (cont.)
  4. (This one) Lua in Rust: Lua lexemes
  5. Lua in Rust: More parsing
  6. Lua in Rust: Left recursion