While this is likely true for this specific prompt, I think that cherry-picking a single prompt that DALL-E outperforms SD on is not super indicative of anything. I've conversely found a large number of prompts where SD outperforms DALL-E, either in aesthetic quality or just following directions! I think you'd really have to compare both of them across a large number of prompts of different types to be sure.
To say nothing of the fact that you have lots of sliders to configure jsut how closely or loosely it follows your prompt. And choice in sampling methods.
You can't just compare SD and DALL-E performance on prompts alone, because SD gives you a lot more levers to steer it in the direction you want.
> house interior, friendly, playful, video game, screenshot, mockup, birds-eye view, top down perspective, jrpg, 32 bit, pixel art, black background
SD absolutely demolishes DALL-E on this one. SD produces really nice-looking output, with a high degree of consistency. DALL-E produces incoherent nonsense.