*The Rust version probably could be made to work at an equivalent speed with eno...

constexpr · on Feb 15, 2020

Both the Rust and Go parsers were written by hand. They are also very similar (basically the Go version was a direct port of the Rust version) so the performance should be very comparable.

I assume by zero-copy you mean that identifiers in the AST are slices of the input file instead of copies? I was also careful to do this in both the Go and Rust versions. It's somewhat complicated because some JavaScript identifiers can technically have escape sequences (e.g. "\u0061bc" is the identifier "abc"), which require dynamic memory allocation anyway. See "allocatedNames" in the current parser for how this is handled.

Note that strings aren't slices of the input file because JavaScript strings are UTF-16, not UTF-8, and can have unpaired surrogates. So I represent string contents as arrays of 16-bit integers instead of 8-bit slices (in both Go and Rust).

In the past I tried using WTF-8 encoding (https://simonsapin.github.io/wtf-8/) for string contents, since that can both represent slices of the input file while also handling unpaired surrogates, but I ended up removing it because it complicated certain optimizations. I think the main issue was having to reason through weird edge cases such as constant folding of string addition when two unpaired surrogates are joined together. I think it's still possible to do this but I'm not sure how much of a win it is.

inferiorhuman · on Feb 16, 2020

They are also very similar (basically the Go version was a direct port of the Rust version) so the performance should be very comparable.

Sure, but different approaches are going to be more optimal for different languages.

I assume by zero-copy you mean that identifiers in the AST are slices of the input file instead of copies?

Yes. From the README:

zero-copy: if a parser returns a subset of its input data, it will return a slice of that input, without copying

Geal also makes claims that nom is faster than hand-written C parsers.

It's somewhat complicated because some JavaScript identifiers can technically have escape sequences (e.g. "\u0061bc" is the identifier "abc"), which require dynamic memory allocation anyway.

Nom comes with 'escaped' and 'escaped_transform' combinators. In theory it should be possible, with relative ease, to return a slice if there are no escape characters and an allocated string if expansion is required. Presumably you'd have to use a Cow<str> though.

Note that strings aren't slices of the input file because JavaScript strings are UTF-16, not UTF-8, and can have unpaired surrogates. So I represent string contents as arrays of 16-bit integers instead of 8-bit slices (in both Go and Rust).

Of course it is. My opinion (which is worth what you've paid for it) is that I'd just go for UTF-8 support. I can't remember the last time I've seen UTF-16 in the wild (thankfully).

Performance-wise the other thing that I'd keep in mind with rust is that in debug mode string handling is painfully slow.

Edit: here's the URL for nom: https://github.com/Geal/nom