I'll go against the grain here and say that async/await, wether implemented by one thread-per-core like here or by stackless coroutines is not the solution.
Async/await will make complexity explode because of the colored function problem [1].
The solution to expensive context switches is cheap context switches, plain and simple. User-mode lightweight threads like go's, or upcoming Java's with Loom [2] have proven that this is possible.
Yes, it does mean that it can only happen in a language that controls its stack (so that you can slice it off and pop a continuation on it). I sincerely believe this is Rust's ballpark; hell they even started the project with that idea in mind.
The "Colored function problem" is a complete fallacy: aside from the fact that async and non-async fns are interoperable in Rust (block_on) and other languages, the same argument could be made against functions with any preconditions.
> The solution to expensive context switches is cheap context switches, plain and simple.
Except the performance difference between a kernel-mode context switch and a user-mode one is only going to narrow in the future. The overhead that cannot be eliminated from context switches is their effect on the cache, since you start to run into the laws of physics at that point...
The real solution to expensive context switches is to just do fewer of them... No context switch is always faster than a "fast" context switch.
> I sincerely believe this is Rust's ballpark
I think it's plausible that Rust could get a library-level solution for fibers that does not rely on unstable details of the compiler. Rust will never again have that baked in to the language, as it would make the language completely unsuitable for many low-level tasks.
Fibers, especially the way they are implemented in Go, come with a lot of their own complexity.
This is just one of the segfaults caused by Go's complex stack control. I don't want to rely on a runtime that contains these sorts of bugs, and the best way to avoid that is to avoid having a runtime in the first place.
Blocking an OS thread as a mean to be compatible is not exactly what we're trying to do here.
> Except the performance difference between a kernel-mode context switch and a user-mode one is only going to narrow in the future
OS overhead can be minimized, but program stacks are a function of the language's. And if you're not right sizing preemption points in your stack, you'll be switching large parts of it. This means you _must_ have stackful coroutines if you want to keep switching threads.
> The real solution to expensive context switches is to just do fewer of them... No context switch is always faster than a "fast" context switch.
Sure, but writing the perfect assembly and using gotos has always been the fastest. Abstraction has a cost, and some runtimes/languages are currently proving that they can reduce this cost to zero in the current conditions of IO being much costlier than a few 100s of nanos. We're just happening to be at a time where the compiler is starting to be smarter than the user. But I guess the benchmarks will settle all this.
So that's a bug in the go compiler. They can either fix it or pay a "a small speed penalty (nanoseconds)" as a workaround, which the author qualifies as "acceptable".
Yes, that's not the absolute performance possible. But why care about that? At some point it all comes down to TCO (except for latencies in HFT); and TCO tells you that it's ok. Development complexity and maintainability matters. Especially when you can max out your IO usage for the decades to come.
> Blocking an OS thread as a mean to be compatible is not exactly what we're trying to do here.
It's what Go does whenever you call into a C function or make a system-call. As long as those blocking functions are not the bottleneck then it works fine.
My problem with the "colored function" analogy is that it implies that the problem is somehow due to the surface syntax, when in reality the problem still exists in all languages that support procedural IO: some of those languages like to just pretend that the problem doesn't exist.
The only language I'm aware of which truly solves that is Haskell, since all IO happens via a monad.
> Yes, that's not the absolute performance possible. But why care about that?
This point was not about performance. It's about the pitfalls of writing all of your code on top of a complex and buggy runtime.
Programming is a lot simpler, and development is a lot faster, when I don't have to worry about that.
There are also several things that are a lot more complicated when you bake a complex runtime into the language like Go does. Thread-local storage is completely broken for one. If you do any kind of GUI programming, you may need to use `runtime.LockOSThread` as most GUIs expect function calls from a single thread. etc. etc.
I don't know too much about Rust, but in .NET, blocking on Task.Result is considered an anti-pattern and a Bad Thing To Do. Not in the least because it very easily leads to deadlocks.
This is considered bad in .NET because callbacks can "Post" back to the original SynchronizationContext which depends on what async/await is being used for. For example, a call to await on a WPF UI thread will join to the calling thread so if you call Task.Result without configuring the task to not join back to the calling thread then you can deadlock the callback processing queue. To avoid this you would use ConfigureAwait(false) depending on your situation. It's the source of a lot of confusion in .NET. I don't believe that Rust has this "feature" and if you somehow wrote code to achieve the same thing as .NET does then it probably wouldn't compile in Rust due to the ownership rules.
Wait, Rust doesn't associate executors with futures? If it does, and there is a single-threaded executor available, then it's absolutely possible to deadlock just like in .NET.
That is a good question. It is possible to avoid deadlocks in a single threaded executor with higher priority interrupts but I’m no authority in this area. Maybe someone else can comment. Most of my understanding in this area comes from reading this article: https://os.phil-opp.com/async-await/
In Rust a task which is started on a given executor never leaves it. It can not switch executors like it can in C# where the continuations could be called on an arbitrary thread. In Rust, wakeups for an async task essentially just schedule it for running again, but the execution happens on the previous executor.
Anyway, in both models you can have deadlocks. And even if there is no deadlock, blocking the eventloop is still an antipattern, since it prevents other tasks which might be able to make progress from running.
When I read "Colored function problem" and "make complexity explode" I thought it was some weird NP-complete scheduling issue having to do with graph coloring or something, but it turns out it's just a fancy term for not wanting to add async to everything that calls async. Basically just an ergonomics issue.
First, in Rust this isn't really a problem. You can always turn async calls into blocking ones in Rust by calling block_on [0]. In some languages block_on doesn't exist, like in in-browser js, because here, code is supposed to be async. But in Rust there is no requirement, so there's no colored function problem here.
Second, I don't think it's a big problem in the first place. In one of my projects, I'm using an async library and have isolated the async-ness by creating a dedicated thread that communicates with the library. The thread provides a queue of messages that the remaining code of my project can handle.
the issue is turning blocking calls (or non-async calls) into non-blocking ones, or simply yielding from a callback deep into a callstack (the usual example is turning an internal iterator into an external one).
Of course you can add async to the whole callstack, but it could be third party code and it might require code duplication if async adds a penalty to compared to non-async code.
Ideally the fixed stack size/conversion to state machine would be an optimization that the compiler would apply if it can prove that the coroutine ever yields form top level (or from a well known and fixed stack depth) and resort to dynamic stacks otherwise. I have been thinking a lot about this, and I think the key is reifying the incoming continuation and, as long as it doesn't escape the called coroutine , the optimization can be guaranteed. I believe that rust lifetime machinery might help, but it is something I'm not familiar with.
It's not that hard, it's just pointless. Async is more general, so the optimization would have to go into the opposite direction: everything starts as implicitly async and things that provably don't need to be can just be converted to regular synchronous stackful code as an optimization pass.
that's what many functional programming languages do (or did, I think it went a bit out of syle), but it is expensive and the interoperability story with C is not good.
edit: it also requires heap allocating activation frames in the most general case, which is slow.
After full CPS transformation, you absolutely can allocate the activation frames on the stack. Cf. CHICKEN Scheme that does precisely that, and more generally, uses the stack as the 0-generation. When the stack hits a certain depth, it longjmp's to the GC, copies whatever is alive to the heap and restarts the current continuation on the now-trimmed stack.
yes, you can even use the original C stack as a bump allocator (That's Cheney on the MTA, right?), but then you need GC, which is not appropriate for rust.
If any part of your function depends on the result of a promise then it has to return a promise; this is true whether you use async/await syntax or then(). Unfortunately JS doesn't have something like block_on, which now that I think about it is probably because it's single-threaded, so any block would block the entire app.
is not fun, especially when you have to await the result of the arithmetic operators, and write all those now-redundant async and await, so let's also drop "async" and make "await" implicit--although we'll need something for non-awaiting. Let's call it "nowait". Now we can write code like this:
Agree :) . Semantically there is no difference, but in practice async functions are compiled differently than normal functions. what you are proposing in practice is to do a global CPS transform and possibly reconstruct back the original function when the continuation is not needed (i.e. it is always invoked in strict LIFO order). There are functional languages where this is the norm, but I don't think this is appropriate for a system language like rust (it would greatly complicate interop with C for example).
My idea is to make the continuation explicit and CPS transform all and only the functions that have a continuation as parameter (and any generic function with a type that is a continuation or contains a continuation). Fall back to dynamic stacks if the continuation escapes.
No, it's not necessary to perform a CPS transformation. After all, the OS manages to pre-emptively schedule any non-cooperating processes, right?
So, leaving aside the problem of interacting with the OS for I/O, you can do interleaved CPU-bound in a single thread, by instrumenting your code with basically an instruction counter. When it grows a bit too large, stop, reset it, and switch to another task. Erlang does this, although being a functional language, it counts only function calls (no other way to make a loop than to (tail-)call itself). No CPS needed.
Maybe I am part of the problem by emphasizing context switching too much, but as penberg noticed, there is more to thread per core than context switching.
I'm quite tired of “What Color is Your Function”. In the five years since it was written I have yet to care about this supposed explosive problem in any language that uses async/await. It just seems like something that gets trotted out whenever anyone dares to suggest there are benefits to the approach or to talk about Rust. I'm not sure why I am supposed to accept it as meaningful truth.
That doesn't work because async functions need to be compiled for minimal stack usage (i.e. segmented stacks, packed variables), while non-async functions need to be compiled for minimal CPU usage (i.e. no stack checking or allocation, no variable packing, only one stack adjustment per function).
If you compile both for the same goal the system will have suboptimal memory usage or performance, and in general you don't know whether a function will be interrupted in advance if you don't have colored functions (also you can't move the stack if you have references unless you have a GC, which is terrible).
So thread-per-core is not just about eliminating context switches. It's also about partitioning application-level data to reduce inter-core synchronization to let speculative, out-of-order cores run independently as much as possible. Don't get me wrong, user-level threads are great, and arguably much simpler programming model than async/await. But they're not _the solution_ either.
User-level threads do not solve the problem of synchronization and data movement. That is, with a “thread-per-core” model, you eliminate most needs to synchronize between multiple CPU cores, and, therefore, eliminate the overhead of acquiring and releasing a lock and allow the out-of-order CPU cores run at full speed. Furthermore, “thread-per-core” model ensures that memory accesses are always CPU local, which eliminates the need to move data between CPU caches, which eliminates the expensive CPU cache consistency protocol (that makes scaling to multiple cores more difficult).
That said, I am not claiming thread-per-core is _the solution_ either, just saying that if you can partition data at application-level, you can make things run plenty fast. Of course, you’re also exposing yourself to other issues, like “hot shards” where some CPUs get disproportionately more work than others. However, as we scale to more and more cores, it seems inevitable that we must partition our systems at some level.
It's pretty much common knowledge that if you want linear performance scaling on a multiprocessor then each workload has to be effectively independent from the others. Synchronization and communication is expensive and only takes away from your per core performance but it never makes a given core faster. So yes, sharding is a critical part of this "thread per core" architecture.
To be fair, concurrent writes do not scale, but single writers multiple readers work just fine, so you only need to an hard partition of writers and can allow all threads (or the appropriate NUMA subset) to access any data.
binding user level threads to hardware cores without preemption is completely equivalent to the thread-per-core + coroutines, except that user level threads allow deep stacks (with all the benefits and issues that it implies).
In fact a well designed generic executor can be completely oblivious to whether it is scheduling plain closures, async functions or fibers. See for example boost.asio.
I agree -- I was just around in the last days of Windows 3.1 and co-operative multitasking (which is close to async with a different coat), and it was a horrible experience, and fighting hangs was a never-ending task.
I would go further, and say assuming your program is doing "interesting work" (not just calling a database / reading a file and returning the results), use a threadpool. In almost all applications threads are fine, certainly when you are comparing threaded Rust to people using Python/Ruby.
>I sincerely believe this is Rust's ballpark; hell they even started the project with that idea in mind.
The problem with custom stacks is you essentially need to have a custom runtime and it makes calling into C functions a lot more difficult (cgo isn’t a cakewalk)
This runs counter to Rust’s C++ inherited belief that you don’t pay for what you don’t use and would have have made Rust less feasible for all kinds of other projects.
async/await might not have the best developer ergonomics, but it does have the best implementation ergonomics from a language point of view
Is there a mechanism to group some goroutines onto a single CPU thread to avoid locking? In the environments that have to rely on thread-per-core (sensitive to even 50us), blocking functions will become obvious in the profiling report.
Nothing stops a model with lightweight threads from limiting a thread from having a separate pool of fibers running on it. It would probably require a construct like a nursery configured to run the fibers in a dedicated thread.
it would be good if Rust would have as many programming models as possible. Different applications are likely to benefit from different approaches so having a wide array of choice is good.
No, the point of async "coloring" is to indicate which function call does context switching, so that you would know in which piece of code you could do synchronization without something like a semaphore, as you'd need a semaphore if you synchronize shared memory access across async calls, but you don't if you do it in between async calls. And also to know which piece of code blocks event loop (i.e. everything in between async calls), so that you could optimize for performance: throughput, latency. Assuming, of course, that you run one event loop per thread, not doing dumb things like some Rust executors scheduling tasks to different cores.
Goroutine (loom) style concurrency only makes things worse both ergonomically and performance-wise, pushing programmers towards slow buggy lock-riddled code as the only way to use such models.
You turn a real engineering issue into a programming language dogma war. Really?
Dont use async-await then. Its just syntactic sugar (minus a few details that make it so ergonomic that you can do things that otherwise would be more work than they are worth). Just use plain futures. Be joyful as there will be no "colored" functions in your editor - only 10x more code (and less readable!).
I think there is a place for arguing about async await syntax and whether fibers are the better solution (try implementing them without a large runtime! Rust got rid of theirs.). Clearly, the benefits provided by continuations are worthwhile in avoiding context switches - to argue otherwise is wild. Context switches are very expensive and, in performance critical code, a huge red flag.
Async/await will make complexity explode because of the colored function problem [1].
The solution to expensive context switches is cheap context switches, plain and simple. User-mode lightweight threads like go's, or upcoming Java's with Loom [2] have proven that this is possible.
Yes, it does mean that it can only happen in a language that controls its stack (so that you can slice it off and pop a continuation on it). I sincerely believe this is Rust's ballpark; hell they even started the project with that idea in mind.
[1] https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...
[2] https://jdk.java.net/loom/