Quickly looking at the source code, mostly treeBuilder and tokenizer, I do see several possible improvements:
- Use Typescript instead of JavaScript
- Use perfect hashes instead of ["a', "b", "c"].includes() idioms, string equalities, Seys, etc.
- Use a single perfect hash to match all tags/attribute names and then use enums in the rest of the codebase
- Use a single if (token.kind === Tag.START instead of repeating that for 10 consecutive conditionals
- Don't return the "reprocess" constant, but use an enum or perhaps nothing if "reprocess" is the only option
- Try tail recursion instead of a switch over the state in the tokenizer
- Use switches (best after a perfect hash lookup) instead of multiple ifs on characters in the tokenizer
- "treeBuilder.openElements = treeBuilder.open_elements;" can't possibly be good code
Perhaps the agent can find these themselves if told to make the code perfect and not just pass tests
I didn't include the TypeScript bit though - it didn't use TypeScript because I don't like adding a build step to my JavaScript projects if I can possible avoid it. The agent would happily have used TypeScript if I had let it.
I don't like that openElements = open_elements pattern either - it did that because I asked it for a port of a Python library and it decided to support the naming conventions for both Python and JavaScript at once. I told it to remove all of those.
It pushed back against the tail recursion suggestion:
> The current implementation uses a switch statement in step(). JavaScript doesn’t have proper tail call optimization (only Safari implements it), so true tail recursion would cause stack overflow on large documents.
Perhaps the agent can find these themselves if told to make the code perfect and not just pass tests