(I agree that thread-per-request works just fine in the majority of cases, but it's still worthwhile to write about the cases where it doesn't work.)
Responding to your original post: you argue that async/await intends to solve the problem of data races. That's not why people use it, nor does it tackle that problem at all (you still need locks around shared data).
It only tries to solve the issue of highly-concurrent servers, where requests are bound by some resource that a request-handling threads have to wait for the result of (typically I/O).
Coroutines/fibers are not an alternative to async servers, because they need primitives that are either baked into the language or the OS itself to work well.
Coroutines/fibers are completely orthogonal to async anything. The OP is arguing against poor-man coroutines, aka stackless coroutines aka top-level yield only, which are significantly less expressive and composable than proper stackfull coroutines (i.e. first class one shot continuations).
An alleged benefit of stackless coroutines is that yield point are explicit, so you know when your state can change. The OP is arguing that this is not really a benefit because it yield to fragile code. I happen to strongly agree.
Green threads / coroutines / fibers are isomorphic with async keyword transparently implemented as a continuation passing style transform, which is how async callbacks usually work. Actual CPU-style stacks in a green thread scenario are nested closure activation records in an explicit continuation passing style scenario, and are implicit closure activation records (but look like stacks) when using an 'async' compiler-implemented CPS.
Properly composed awaits (where each function entered is entered via an await) build a linked list of activation records in the continuations as they drill down. This linked list is the same as the stack (i.e. serves the same purpose and contains the same data in slightly different layout) in a green threads scenario.
What makes all these things different is how much they expose the underlying mechanics, and the metaphors they use in that exposition. But they're not orthogonal.
(If you meant 'async' as in async IO explicitly, rather than the async / await keyword with CPS transform as implemented in C#, Python, Javascript, etc., then apologies.)
As you said, you can of course recover stackful behaviour by using yield/await/async/wathever at every level of the call stack, but in addition to being a performance pitfall (you are in practice heap allocating each frame separately and yield is now O(N): your iterpreter/compiler/jit will need to work hard to remove the abstraction overhead), it leads to the green/red function problem.
Please correct me if I'm wrong, but doesn't asyncio in the form of async/await (or any other ways to explicitly denote context switches) solve the problem of data races in that per-thread data structures can be operated on atomically by different coroutines? My understanding is that unless data structures are shared with another thread, you don't usually need locks for shared data.
async and threads are fundamentally different mechanisms. green threads (async) are scheduled by the runtime, threads are scheduled by the OS.
In CPython threads can be (theoretically) switched at every bytecode instruction. Since calls into extensions / the interpreter are a single instructions, many data structure updates (like dict[a] = b, list.append) will appear atomic from Python.
That being said it is rather rare to have multiple threads run an event loop and process requests in Python. If threads and async are combined in the same process in Python, then it's usually only one event loop thread, and a thread pool for background activity. Usually these will be synchronized through async (eg. tornado.concurrent.run_on_executor) -- but that has nothing to do with context switches.
Edit: Reread your post. I may have slightly missed the point :)
Yes. Often one will find/design that there is no shared state, or that shared state is modified completely between yield points, so no locks between coroutines needed.
Responding to your original post: you argue that async/await intends to solve the problem of data races. That's not why people use it, nor does it tackle that problem at all (you still need locks around shared data).
It only tries to solve the issue of highly-concurrent servers, where requests are bound by some resource that a request-handling threads have to wait for the result of (typically I/O).
Coroutines/fibers are not an alternative to async servers, because they need primitives that are either baked into the language or the OS itself to work well.