"Basically seeing if they can code" is the goal, yes.

> I don't think how far a candidate gets on this particular test determines if they're OK or if they're great.

It doesn't with certainty, but measurement of this kind is always subject to some degree of error. The point of interviewing (especially at the early stage where this interview is applied) is signal, not certainty, and we do seem to be picking up on signal:

-- # of steps completed on the coding task is the strongest single signal of success at later final interview rounds. Candidates who have gotten offers (not from us, to be clear, so this is uncorrelated error) average ~1.1 more steps completed on our coding problem than candidates who don't. That's after we filtered a lot of the weaker results out, so it's sampling biased in slower candidates' favor. This is a larger gap between offers and non-offers than any of the other nineteen individual internal scores we record for each interview, iirc.

-- # of steps completed correlates with results on the rest of the interview in a way that is aligned with reasonable psychometric measures of quality, same as most of the other parts of the interview (see e.g. [1]).

If your point here is just that interviews involve somewhat artificial work and are prone to error - well, yes. Everyone involved in testing as a field already knows that and is working around it.

To use your concrete example, creating a problem involving "sussing out all the ambiguities and contradictions" for an hour would be asking a lot more work from candidates. If we did that, I bet we'd have people complaining about the lengthy workload of doing an interview (or simply not doing it, which would be fatal to us as a business). I would also worry that that's a much fuzzier skill to measure, particularly in a cross-organizational way. It's not that, in isolation, you could never create an interview to test ambiguity-resolution, it's that (at least in my judgment), it would be impractical for us to do so given what we are trying to do.

And finally, to this:

> I posit that a great one could finish 2-3 steps and an OK one could finish 4-5.

See [2].

-------

[1] https://bsky.app/profile/otherbranch.bsky.social/post/3lpcdl...

[2] https://news.ycombinator.com/item?id=43006330