I just measured this with the np.ascontiguousarray call (inside the loop of cour...

I just measured this with the np.ascontiguousarray call (inside the loop of course, if you do it outside the loop when i2 is assigned to, then ofc you don't measure the overhead of np.ascontiguousarray) and the runtime is about the same as without this call, which I would expect given everything TFA explains. So you still have the 100x slowdown.

TFA links to a GitHub repo with code resizing non-zero images, which you could use both to easily benchmark your suggested change, and to check whether the output is still correct.