More

mycoliza · 2025-10-31T23:34:51 1761953691

In reply to your edit, that section in the RFD includes a link to the full example in the Rust playground. You’ll note that it does not make any use of ‘select!`: https://play.rust-lang.org/?version=stable&mode=debug&editio...

Perhaps the full example should have been reproduced in the RFD for clarity…

mycoliza · 2025-10-31T22:35:02 1761950102

An analogous problem is equally possible with streams: https://rfd.shared.oxide.computer/rfd/0609#_how_you_can_hit_...

mycoliza · 2025-10-31T22:26:12 1761949572

> No, just have select!() on a bunch of owned Futures return the futures that weren't selected instead of dropping them. Then you don't lose state.

How does that prevent this kind of deadlock? If the owned future has acquired a mutex, and you return that future from the select so that it might be polled again, and the user assigns it to a variable, then the future that has acquired the mutex but has not completed is still not dropped. This is basically the same as polling an `&mut future`, but with more steps.

octoberfranklin · 2025-10-31T22:37:37 1761950257

> How does that prevent this kind of deadlock?

Like I said, it doesn't:

> even this doesn't fix the root cause because dropping an owned Future isn't guaranteed to cancel it cleanly.

It fixes this:

> However you also wouldn’t be able run use select! in a while loop and try to acquire the same lock (or read from the same channel) without losing your position in the queue.

If you want to fix the root cause, see https://news.ycombinator.com/item?id=45777234

mycoliza · 2025-10-31T22:24:45 1761949485

Indeed, you are correct (and hi Matthias!). After we got to the bottom of this deadlock, my coworkers and I had one of our characteristic "how could we have prevented this?" conversations, and reached the somewhat sad conclusion that actually, there was basically nothing we could easily blame for this. All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too. Not being able to do that would make it pretty hard to express a lot of things that many applications might reasonably want to express, as you described. I discussed this a bit in this comment[1] as well.

On the other hand, it also wasn't our coworker who had written the code where we found the bug who was to blame, either. It wasn't a case of sloppy programming; he had done everything correctly and put the pieces together the way you were supposed to. All the pieces worked as they were supposed to, and his code seemed to be using them correctly, but the interaction of these pieces resulted in a deadlock that it would have been very difficult for him to anticipate.

So, our conclusion was, wow, this just kind of sucks. Not an indictment of async Rust as a whole, but an unfortunate emergent behavior arising from an interaction of individually well-designed pieces. Just something you gotta watch out for, I guess. And that's pretty sad to have to admit.

[1] https://news.ycombinator.com/item?id=45776868

kibwen · 2025-11-01T12:21:26 1761999686

> All the Tokio primitives involved were working precisely as they were supposed to. The only thing that would have prevented this without completely re-designing Rust's async from the ground up would be to ban the use of `&mut future`s in `select!`...but that eliminates a lot of correct code, too.

But it still suggests that `tokio::select` is too powerful. You don't need to get rid of `tokio::select`, you just need to consider creating a less powerful mechanism that doesn't risk exhibiting this problem. Then you could use that less powerful mechanism in the places where you don't need the full power of `tokio::select`, thereby reducing the possible places where this bug could arise. You don't need to get rid of the fully powerful mechanism, you just need to make it optional.

tux3 · 2025-11-01T13:48:41 1762004921

I feel like select!() is a good case study because the common future timeout use-case maps pretty closely to a select!(), so there is only so much room to weaken it.

The ways I can think of for making select!() safer all involve runtime checks and allocations (possibly this is just a failure of my imagination!). But if that's the case, I would find it bothersome if our basic async building blocks like select/timeout in practice turn out to require more expensive runtime checks or allocations to be safe.

We have a point in the async design space where we pay a complexity price, but in exchange we get really neat zero-cost futures. But I feel like we only get our money's worth if we can actually statically prove that correct use won't deadlock, without the expensive runtime checks! Otherwise, can we afford to spend all this complexity budget?

The implementation of select!() does feel way too powerful in a way (it's a whole mini scheduler that creates implicit future dependencies hidden from the rest of the executor, and then sometimes this deadlocks!). But the need is pretty foundational, it shows up everywhere as a building block.

kibwen · 2025-11-01T14:13:37 1762006417

It feels to me like there's plenty of design space to explore. Sure, it's possible to view "selection" as a basic building block, but even that is insufficiently precise IMO. There's a reason that Javascript provides all of Promise.any and Promise.all and Promise.allSettled and Promise.race. Selection isn't just a single building block, it's an entire family of building blocks with distinct semantics.

imtringued · 2025-11-01T13:14:45 1762002885

You must guarantee forward progress inside your critical sections and that means your critical sections are guaranteed to finish. How hard is that to understand? From my perspective this situation was basically guaranteed to happen.

There is no real difference between a deadlock caused by a single thread acquiring the same non reentrant lock twice and a single thread with two virtual threads where the the first thread calls the code of the second thread inside the critical section. They are the same type of deadlock caused by the same fundamental problem.

>Remember too that the Mutex could be buried beneath several layers of function calls in different modules or packages. It could require looking across many layers of the stack at once to be able to see the problem.

That is a fundamental property of mutexes. Whenever you have a critical section, you must be 100% aware of every single line of code inside that critical section.

>There’s no one abstraction, construct, or programming pattern we can point to here and say "never do this". Still, we can provide some guidelines.

The programming pattern you're looking for is guaranteeing forward progress inside critical sections. Only synchronous code is allowed to be executed inside a critical section. The critical section must be as small as possible. It must never be interrupted, ever.

Sounds like a pain in the ass, right? That's right, locks are a pain in the ass.

mycoliza · 2025-10-31T22:02:40 1761948160

Yeah, a coworker coming from Go asked a similar question about why Rust doesn't have something like the Go runtime's deadlock detector. Your comment is quite similar to the explanation I gave him.

Go, unlike Rust, does not really have a notion of intra-task concurrency; goroutines are the fundamental unit of concurrency and parallelism. So, the Go runtime can reason about dependencies between goroutines quite easily, since goroutines are the things which it is responsible for scheduling. The fact that channels are a language construct, rather than a library construct implemented in the language, is necessary for this too. In (async) Rust, on the other hand, tasks are the fundamental unit of parallelism, but not of concurrency; concurrency emerges from the composition of `Future`s, and a single task is a state machine which may execute any number of futures concurrently (but not in parallel), by polling them until they cannot proceed without waiting and then moving on to poll another future until it cannot proceed without waiting. But critically, this is not what the task scheduler sees; it interacts with these tasks as a single top-level `Future`, and is not able to look inside at the nested futures they are composed of.

This specific failure mode can actually only happen when multiple futures are polled concurrently but not in parallel within a single Tokio task. So, there is actually no way for the Tokio scheduler to have insight into this problem. You could imagine a deadlock detector in the Tokio runtime that operates on the task level, but it actually could never detect this problem, because when these operations execute in parallel, it actually cannot occur. In fact, one of the suggestions for how to avoid this issue is to select over spawned tasks rather than futures within the same task.

mjevans · 2025-11-01T02:10:36 1761963036

Thank you. Every time I've tried to approach the concept of Rust's parallelism this is what rubs me the wrong way.

I haven't yet read a way to prove it's correct, or even to reasonably prove a given program's use is not going to block.

With more traditional threads my mental model is that _everything_ always has to be interrupt-able, have some form of engineer chosen timeout for a parallel operation, and address failure of operation in design.

I never see any of that in the toy examples that are presented as educational material. Maybe Rust's async also requires such careful design to be safely utilized.

zenmac · 2025-11-01T03:34:25 1761968065

Guess Rust is more built for memory safety not concurrency? Erlang maybe? Why can't we just have a language that is memory safe and built for concurrency? Like Ocaml and Erlang combine?

jitl · 2025-11-01T06:02:20 1761976940

Are you looking for Gleam? Simple but powerful typed functional language for BEAM and JavaScript. It’s a bit high level compared to Ocaml in terms of needing a thick runtime and being somewhat far from machine code.

Really beautiful language design imo. Does a great job avoiding the typelevel brainfuck problem I have with Haskell.

https://gleam.run/

kibwen · 2025-11-01T12:15:08 1761999308

Rust is absolutely built for concurrency, even moreso than for memory safety--it just so happens that memory safety is a prerequisite for thread safety. You're going to have a hard time finding any other industrial-strength language that statically prevents data races. If you can use Erlang, then sure, use Erlang. But if you can't use Erlang, and you need concurrency, you're not going to find a better candidate than Rust.

gf000 · 2025-11-01T09:18:26 1761988706

I think an important takeaway here that many often ignore is that in language design, not having low-level control over something is sometimes just as important design tradeoff as having it.

From that it also follows that it may not be too fruitful to try to tackle every domain there is with a single language only.

(With that said, I absolutely love sync Rust, and Go is definitely not a good example of an elegantly designed language, I am talking in a more general way here)

mycoliza · 2025-10-31T21:41:04 1761946864

As a member of (Eliza, Sean, John, and Dave), I can second that debugging this was certainly an adventure. I'm not going to go as far as to say that we had fun, since...you can't have a heroic narrative without real struggle. But it was certainly rewarding to be in the room for that "a-ha!" moment, in which all the pieces really did begin to fit together very quickly. It was like the climax of a detective story --- and it was particularly well-scripted the way each of us contributed a little piece of the puzzle.

littlestymaar · 2025-10-31T22:33:01 1761949981

Since you are of of the people working directly on this codebase, may I ask you why is select! being used/allowed in the first place?

Its footgun-y nature has been known for years (IIRC even the first version of the tokio documentation warned against that) and as such I don't really understand why people are still using it. (For context I was the lead of a Rust team working on a pretty complex async networking program and we had banned select! very early in the project and never regretted this decision once).

0xdeafbeef · 2025-10-31T22:57:31 1761951451

What to use instead?

surajrmal · 2025-11-02T15:30:49 1762097449

Your favorite llm will give you several good options. You can use an mpsc channel where you have a task per sender which waits on each branch in the select and then sends a message and then just wait on the receiver end of that channel. Or you could use https://docs.rs/futures/latest/futures/future/fn.select.html (or the select_all version). Both make it more obvious what is going on. The last way would be to implement a future manually. This would probably be my favorite option in many cases as it would be least overhead, but if you've never implemented a future it may be a bit intimidating.

Edit: tokio docs actually provide several suggestions: https://docs.rs/tokio/latest/tokio/macro.select.html#alterna...

mycoliza · 2025-10-31T21:34:15 1761946455

We actually don't use Rust async in the embedded parts of our system. This is largely because our firmware is based on a multi-tasking microkernel operating system, Hubris[1], and we can express concurrency at the level of the OS scheduler. Although our service processors are single-core systems, we can still rely on the OS to schedule multiple threads of execution.

Rust async is, however, very useful in single-core embedded systems that don't have an operating system with preemptive multitasking, where one thread of execution is all you ever get. It's nice to have a way to express that you might be doing multiple things concurrently in an event-driven way without having to have an OS to manage preemptive multitasking.

[1] https://hubris.oxide.computer/

jabedude · 2025-10-31T22:22:10 1761949330

Heh, this is super interesting to hear. Single threaded async/concurrent code is so fun and interesting to see. I’ve ran some tokio programs in single threaded mode just to see it in action

mycoliza · 2025-10-31T21:24:23 1761945863

What could `tokio::select!` do differently here to prevent bugs like this?

In the case of `select!`, it is a direct consequence of the ability to poll a `&mut` reference to a future in a `select!` arm, where the future is not dropped should another future win the "race" of the select. This is not really a choice Tokio made when designing `select!`, but is instead due to the existence of implementations of `Future` for `&mut T: Future + Unpin`[1] and `Pin<T: Future>`[2] in the standard library.

Tokio's `select!` macro cannot easily stop the user from doing this, and, furthermore, the fact that you can do this is useful --- there are many legitimate reasons you might want to continue polling a future if another branch of the select completes first. It's desirable to be able to express the idea that we want to continually poll drive one asynchronous operation to completion while periodically checking if some other thing has happened and taking action based on that, and then continue driving forward the ongoing operation. That was precisely what the code in which we found the bug was doing, and it is a pretty reasonable thing to want to do; a version of the `select!` macro which disallows that would limit its usefulness. The issue arises specifically from the fact that the `&mut future` has been polled to a state in which it has acquired, but not released, a shared lock or lock-like resource, and then another arm of the `select!` completes first and the body of that branch runs async code that also awaits that shared resource.

If you can think of an API change which Tokio could make that would solve this problem, I'd love to hear it. But, having spent some time trying to think of one myself, I'm not sure how it would be done without limiting the ability to express code that one might reasonably want to be able to write, and without making fundamental changes to the design of Rust async as a whole.

[1] https://doc.rust-lang.org/stable/std/future/trait.Future.htm... [2]: https://doc.rust-lang.org/stable/std/future/trait.Future.htm...

quadhome · 2025-11-02T12:23:40 1762086220

It's desirable to be able to express the idea that we want to continually poll drive one asynchronous operation to completion while periodically checking if some other thing has happened and taking action based on that, and then continue driving forward the ongoing operation.

This idea may be desirable; but, a deadlock is possible if there's a dependency between the two operations. The crux is the "and then continue," which I'm taking to mean that the first operation is meant to pause whilst the second operation occurs. The use of `&mut` in the code specifically enables that too.

If it's OK for the first operation to run concurrently with the other thing, then wrt. Tokio's APIs, have you seen LocalSet[1]? Specifically:

    let local = LocalSet::new();
    local.spawn_local(async move {
        sleep(Duration::from_millis(500)).await;
        do_async_thing("op2", lock.clone()).await;
    });
    local.run_until(&mut future1).await;

This code expresses your idea under a concurrent environment that resolves the deadlock. However, `op2` will still never acquire the lock because `op1` is first in the queue. I strongly suspect that isn't the intended behaviour; but, it's also what would have happened if the `select!` code had worked as imagined.

[1] https://docs.rs/tokio/latest/tokio/task/struct.LocalSet.html

rtpg · 2025-11-01T00:47:02 1761958022

A meta-idea I have: look at all usages of `select!` with `&mut future`s in the code, and see if there are maybe 4 or 5 patterns that emerge. With that it might be possible to say "instead of `select!` use `poll_continuing!` or `poll_first_up!` or `poll_some_other_common_pattern!`".

It feels like a lot of the way Rust untangles these tricky problems is by identifying slightly more contextful abstractions, though at the cost of needing more scratch space in the mind for various methods

amluto · 2025-10-31T23:37:45 1761953865

I can imagine an alternate universe in which you cannot do:

1. Create future A.

2. Poll future A at least once but not provably poll it to completion and also not drop it. This includes selecting it.

3. Pause yourself by awaiting anything that does not involve continuing to poll A.

I’m struggling a bit to imagine the scenario in which it makes sense to pause a coroutine that you depend on in the middle like this. But I also don’t immediately see a way to change a language like Rust to reliably prevent doing this without massively breaking changes. See my other comment :)

grogers · 2025-11-01T00:38:36 1761957516

I'm not familiar with tokio, but I am familiar with folly coro in C++ which is similiar-ish. You cannot co_await a folly::coro::Task by reference, you must move it. It seems like that prevents this bug. So maybe select! is the low level API and a higher level (i.e. safer) abstraction can be built on top?

mycoliza · 2025-05-02T20:20:08 1746217208

As an Oxide employee, let's just say...it does, and it is :)

mycoliza · on Nov 22, 2024

We also sell computers... :)