Lua in Rust: Lua lexemes
Now that parser_lib
is ready, it’s time to actually use it to parse Lua.
Lua lexemes
I usually parse from a string directly into a syntax tree. But this time I’ll do something different for a change: I’ll first parse a string into a vector of tokens.
Following Lua Lexical Conventions I defined lua lexemes here.
Everything is pretty straightforward apart from Keyword
and OtherToken
definitions. Probably the simplest way
of parsing them is enumerating all of them and trying one after the other. So, I need a way to enumerate all the
elements of an enum and also convert them into string. To do that I resorted to a declarative macro.
My macro that matches pattern $p:vis $n:ident { $($v:ident => $s:literal),* $(,)?}
. First is an optional visibility modifier
that’ll be bound to $p
, next is an identifier bound to $n
, then {
, then a (possibly empty) list of comma separated
pairs, that are optionally terminated by a comma and finally a }
. A list of pairs are of the form identifier $v
followed
by =>
and then literal $s
. What I wanted to express by all this is: this is an enum with visibility $p
named $n
and
with items $v
that map to strings $s
. There’s a bunch of code generated from this macro:
- Generate the enum itself.
$p enum $n {
$($v),*
}
- Count items. A hacky way: for each item
$v
generate a pair($v, 1)
, extract the second item (i.e.1
) and sum all of them. Note, that function is markconst
so it should (must?) be evaluated at compile time, producing an actual constant instead of a giant1 + 1 + ... + 0
expression.
pub const fn items_count() -> usize {
$(($n::$v, 1).1 + )* 0
}
- Enumerate all the items.
pub const ITEMS: [$n; $n::items_count()] = [
$($n::$v),*
];
- Convert items to string.
pub fn to_str(&self) -> &'static str {
match self {
$($n::$v => $s),*
}
}
Now, that I look back at it, I think it’s an overkill. What I really need is to enumerate all items, everything else can very well be done outside a macro.
Parsing Lua
And so here’s a lexer for Lua.
keyword_lexer
. Pretty straightforward parsing: we iterate over all keywords and pick the first that matches. There’s one problem though: we have keywordselse
andelseif
and if you match againstelse
first, it’ll succeed. To tackle it I first sort (create_sorted_items
) all the keywords by length (the longest first). This wayelseif
will be tried beforeelse
is.other_token_lexer
is implemented similarly.string_literal_lexer
. Borderline unreadable beast, though simple in essence: read an opening quote and then repeatedly read a character (except a closing quote) or an escape symbol, finally read a closing quote.long_brackets_lexer
. Parse the opening long bracket and then parse everything until you reach the closing long bracket.number_literal_lexer
. Parse the number as hexadecimal or as decimal with floating point. Notably, my parser only recognises valid numbers, the actual turning into numbers is done by Rust standard library.identifier_lexer
. Essentially a[a-zA-Z_][a-zA-Z_0-9]*
.token_lexer
. Tries each of the lexer above. It’s important to trykeyword_lexer
beforeidentifier_lexer
, since the latter will match any keyword.comment_lexer
. Parse the beginning of the comment--
and then either parse a long bracket, or everything until end of line.whitespace_lexer
. Repeatedly parse whitespace.tokens_lexer
. Parses a possible comment or a whitespace. And then repeatedly usestoken_lexer
, followed by a comment or a whitespace.
A finishing touch
While writing this post I noticed a mistake that I made: tokens_lexer
is happy to accept the following code:
local x;
x = "';
It produced a stream: local x ; x =
. It correctly identified "';
as an unfinished string and stopped parsing right there. But it didn’t
fail completely, because having an unconsumed input is not an error. Well, in the case of tokens_lexer
it is, so I added
pub fn eof<'a, 'b: 'a>() -> Box<dyn Parser<'b, ()> + 'a> {
Box::new(move |s| {
if s.index == s.input.len() {
Some(((), s))
} else {
None
}
})
}
to the parser_lib
and used it in tokens_lexer
. Now, when all the tokens were parsed, we check if it’s the end of the input stream, and
if it’s not - we fail.
Also, I moved lua_syntax
to lua_lexemes
and lua_parser
to lua_lexer
, because I’ll need names syntax
and parser
when I actually
parse a stream of lexemes into a syntax tree.
And finally I simplified plain_enum
macro to only generate the enumeration bit.
The code is here.