The rendered visual layout is designed in a way to be spatially organized perceptually to make sense. It's a bit like PDFs. I imagine that the underlying hierarchy tree can be quite messy and spaghetti, so your best bet is to use it in the form that the devs intended and tested it for.
I think screenshots are a really good and robust idea. It bothers the more structured-minded people, but apps are often not built so well. They are built until the point that it looks fine and people are able to use it. I'm pretty sure people who rely on accessibility systems have lots of complaints about this.
The progressives were pretty good at pushing accessibility in applications, it's not perfect but every company I've worked with since the mid 2010s has made a big todo about accessibility. For stuff on linux you can instrument observability in a lot of different ways that are more efficient than screenshots, so I don't think it's generally the right way to move forward, but screenshots are universal and we already have capable vision models so it's sort of a local optimization move.
I think screenshots are a really good and robust idea. It bothers the more structured-minded people, but apps are often not built so well. They are built until the point that it looks fine and people are able to use it. I'm pretty sure people who rely on accessibility systems have lots of complaints about this.