Most parsers don't actually work with "lines" as a unit, those are for user-form...

Timon3 · on July 25, 2024

Thanks for the response, but I'm aware of the basics. My question is pointed towards making language parsers resilient towards separately-evolving standards. How would you build a JS parser so that it correctly parses any new TS syntax, without changing behavior of valid code?

The example snippet I added is designed to violate the rules I could come up with. I'd specifically like to know: what are better rules to solve this specific case?

thanksgiving · on July 26, 2024

> How would you build a JS parser so that it correctly parses any new TS syntax, without changing behavior of valid code?

I don't know anything about parsers besides what I learned from that one semester worth of introduction class I took in college but from what I understand of your question, I think the answer is you can't simply because we can't look into the future.

WorldMaker · on July 26, 2024

In your specific case:

1. Automatic semicolon insertion would next want to kick in at the } token, so that's the obvious end of the statement. If you've asked it to ignore from `as` to the end of the statement (as you've established with your "ignore to the end of the 'line'"), that's where it stops ignoring.

1A. Obviously in that case `bar(null` is not a valid statement after ignoring from `as` to the end of the statement.

2. The trick to your specific case, that you've stumbled into is that `as` is an expression modifier, not a statement modifier. The argument to a function is an expression, not a statement. That definitely complicates things because "end of the current expression" is often a lot more complicated than ASI (and people think ASI is complicated). Most parsers are going to have some sort of token state counter for nested parentheses (this is a fun implementation detail of different parsers because while recursion is easy enough in "context-free grammars" the details of tracking that recursion is generally not technically "context-free" at that point, so sometimes it is in the tokenizer, sometimes it is a context extension to the parser itself, sometimes it is using a stack implementation detail of the parser) and you are going to want to ignore to the next "," token that signals a new argument or the next ")" that signals the end of arguments, with respect to any () nesting.

2A. Because of how complicated expression parsing can get, that probably sets some resiliency bounds on your "ignorable grammar": it may require that internally it still follows most of the logic of your general expression language: balanced nested parentheses, no dangling commas, usual comment syntax, etc.

2B. You probably want to define those sorts of boundaries anyway. The easiest way is to say that ignorable extensions such as `as` must themselves parse as if it was a valid expression, even if the language cannot interpret its meaning. You can think of this as the meta-grammar where one option for an expression might be `<expression> ::= <expression> 'as' <expression>` with the second expression being parseable but ignorable after parsing to the language runtime and JIT. You can see that effectively in the syntax description for Python's original PEP 3107 syntax-only type hints standard [1], it's surprisingly that succinct there. (The possible proposed grammar in the Type Annotations proposal to TC39 is a lot more specific and a lot less succinct [2], for a number of reasons.)

[1] https://peps.python.org/pep-3107/

[2] https://tc39.es/proposal-type-annotations/grammar.html