Intentionally or not, this post demonstrates one of the things that makes safer ...

kouteiheika · 2025-11-17T13:42:24 1763386944

> the shared pointer implementation uses a POSIX mutex [...] C++’s shared pointer has the same problem

It doesn't. C++'s shared pointers use atomics, just like Rust's Arc does. There's no good reason (unless you have some very exotic requirements, into which I won't get into here) to implement shared pointers with mutexes. The implementation in the blog post here is just suboptimal.

(But it's true that C++ doesn't have Rust's equivalent of Rc, which means that if you just need a reference counted pointer then using std::shared_ptr is not a zero cost abstraction.)

woodruffw · 2025-11-17T13:45:55 1763387155

To be clear, the “same problem” is that it’s not a zero-cost abstraction, not that it uses the same specific suboptimal approach as this blog post.

kouteiheika · 2025-11-17T13:51:01 1763387461

I think that's an orthogonal issue. It's not that C++'s shared pointer is not a zero cost abstraction (it's as much a zero cost abstraction as in Rust), but that it only provides one type of a shared pointer.

But I suppose we're wasting time on useless nitpicking. So, fair enough.

woodruffw · 2025-11-17T13:59:04 1763387944

I think they’re one and the same: C++ doesn’t have program-level thread safety by construction, so primitives like shared pointers need to be defensive by default instead of letting the user pick the right properties for their use case.

Edit: in other words C++ could provide an equivalent of Rc, but we’d see no end of people complaining when they shoot themselves in the foot with it.

(This is what “zero cost abstraction” means: it doesn’t mean no cost, just that the abstraction’s cost is no greater than the semantically equivalent version written by the user. So both Arc and shared_ptr are zero-cost in a MT setting, but only Rust has a zero-cost abstraction in a single-threaded setting.)

kouteiheika · 2025-11-17T14:16:02 1763388962

I can't say I agree with this? If C++ had an Rc equivalent (or if you'd write one yourself) it would be just as zero cost as it is in Rust, both in a single-threaded setting and in a multithreaded-setting. "Zero cost abstraction" doesn't mean that it cannot be misused or that it doesn't have any cognitive overhead to use correctly, just that it matches whatever you'd write without the abstraction in place. Plenty of "zero cost" features in C++ still need to you pay attention to not accidentally blow you leg off.

Simply put, just as a `unique_ptr` (`Box`) is an entirely different abstraction than `shared_ptr` (`Arc`), an `Rc` is also an entirely different abstraction than `Arc`, and C++ simply happens to completely lack `Rc` (at least in the standard; Boost of course has one). But if it had one you could use it with exactly the same cost as in Rust, you'd just have to manually make sure to not use it across threads (which indeed is easier said than done, which is why it's not in the standard), exactly the same as if you'd manually maintain the reference count without the nice(er) abstraction. Hence "zero cost abstraction".

woodruffw · 2025-11-17T14:28:30 1763389710

Sorry, I realized I’m mixing two things in a confusing way: you’re right that C++ could easily have a standard zero-cost Rc equivalent; I’m saying that it can’t have a safe one. I think this is relevant given the weight OP gives to both performance and safety.

SR2Z · 2025-11-17T19:41:01 1763408461

Isn't the point of using atomics that there is virtually no performance penalty in single threaded contexts?

IMO "zero cost abstraction" just means "I have a slightly less vague idea of what this will compile to."

SkiFire13 · 2025-11-17T19:46:08 1763408768

No, atomics do have a performance penality compared to the equivalent single threaded code due to having to fetch/flush the impacted cache lines in the eventuality that another thread is trying to atomically read/write the same memory location at the same time.

CyberDildonics · 2025-11-17T22:07:21 1763417241

Atomics have almost no impact when reading, which is what would happen in a shared pointer the vast majority of the time.

woodruffw · 2025-11-17T22:20:40 1763418040

> which is what would happen in a shared pointer the vast majority of the time.

This seems workload dependent; I would expect a lot of workloads to be write-heavy or at least mixed, since copies imply writes to the shared_ptr's control block.

oconnor663 · 2025-11-17T23:17:09 1763421429

I think it's pretty rare to do a straight up atomic load of a refcount. (That would be the `use_count` method in C++ or the `strong_count` method in Rust.) More of the time you're doing either a fetch-add to copy the pointer or a fetch-sub to destroy your copy, both of which involve stores. Last I heard the fetch-add can use the "relaxed" atomic ordering, which should make it very cheap, but the fetch-sub needs to use the "release" ordering, which is where the cost comes in.

groundzeros2015 · 2025-11-18T02:03:13 1763431393

> C++ doesn’t have program-level thread safety by construction

It does. It’s called a process.

Everyone chose convenience and micro-benchmarks by choosing threads instead.

woodruffw · 2025-11-18T03:13:56 1763435636

"Thread truther" is not one of the arguments I had on the bingo card for this conversation.

groundzeros2015 · 2025-11-18T03:59:51 1763438391

I guessed as much. I’m not alone - there is a whole chapter on this topic in “The art of UNIX programming”.

cogman10 · 2025-11-17T13:47:12 1763387232

> very exotic requirements

I'd be interested to know what you are thinking.

The primary exotic thing I can imagine is an architecture lacking the ability to do atomic operations. But even in that case, C11 has atomic operations [1] built in. So worst case, the C library for the target architecture would likely boil down to mutex operations.

[1] https://en.cppreference.com/w/c/atomic.html

kouteiheika · 2025-11-17T14:39:14 1763390354

Well, basically, yeah, if your platform lacks support for atomics, or if you'd need some extra functionality around the shared pointer like e.g. logging the shared pointer refcounts while enforcing consistent ordering of logs (which can be useful if you're unfortunate enough to have to debug a race condition where you need to pay attention to refcounts, assuming the extra mutex won't make your heisenbug disappear), or synchronizing something else along with the refcount (basically a "fat", custom shared pointer that does more than just shared-pointering).

colonwqbang · 2025-11-17T15:59:37 1763395177

Does there exist any platform which has multithreading but not atomics? Such a platform would be quite impractical as you can't really implement locks or any other threading primitive without atomics.

addaon · 2025-11-17T17:36:24 1763400984

> Does there exist any platform which has multithreading but not atomics?

Yes. Also, almost every platform I know that supports multi threading and atomics doesn’t support atomics between /all/ possible masters. Consider a microcontroller with, say, two Arm cores (multithreaded, atomic-supporting) and a DMA engine.

lpribis · 2025-11-17T22:31:08 1763418668

Yes but "atomic" operations with the DMA engine are accomplished through interrupts (atomic) or memory mapped IO configuration (atomic).

cogman10 · 2025-11-17T16:18:55 1763396335

Certainly such systems can pretty readily exist. You merely need atomic reads/writes in order to implement locks.

You can't create userspace locks which is a bummer, but the OS has the capability of enforcing locks. That's basically how early locking worked.

The main thing needed to make a correct lock is interrupt protection. Something every OS has.

To go fast, you need atomic operations. It especially becomes important if you are dealing with multiple cores. However, for a single core system atomics aren't needed for the OS to create locks.

SkiFire13 · 2025-11-17T19:51:02 1763409062

> You merely need atomic reads/writes in order to implement locks.

Nit: while it's possible to implement one with just atomic reads and writes, it's generally not trivial/efficient/ergonomic to do so without an atomic composite read-write operation, like a compare-and-swap.

colonwqbang · 2025-11-17T17:13:00 1763399580

I wrote "multithreaded" but I really meant "multicore". If two cores are contending for a lock I don't see how irq protection help. As long as there is only one core, I agree.

cogman10 · 2025-11-17T18:05:01 1763402701

On most multicore systems you can pin the IRQ handling to a single core. Pinning locking interrupts to a single core would be how you handle this.

colonwqbang · 2025-11-19T12:29:19 1763555359

True, but locks are not only needed inside IRQ handler routines.

oconnor663 · 2025-11-17T23:22:51 1763421771

The boring answer is that standard atomics didn't exist until C++11, so any compiler older than that didn't support them. I think most platforms (certainly the popular desktop/server platforms) had ways to accomplish the same thing, but that was up to the vendor, and it might not've been well documented or stable. Infamously, `volatile` used to be (ab)used for this a lot before we had proper standards. (I think it still has some atomic-ish properties in MSVC?)

goalieca · 2025-11-17T15:08:02 1763392082

Which platforms might that be? Even MIPS has atomics (at least pointer sized last i checked).

cogman10 · 2025-11-17T16:03:12 1763395392

AFIAK, and I'm not MIPS expert, but I believe it doesn't have the ability to add a value directly to a memory address. You have to do something like

    // Not real MIPS, just what I've gleaned from a brief look at some docs
    LOAD addr, register
    ADD 1, register
    STORE register, addr

The LOAD and STORE are atomic, but the `ADD` happens out of band.

That's a problem if any sort of interrupt happens (if you are multi-threading then a possibility). If it happens at the load, then a separate thread can update "addr" which mean the later STORE will stomp on what's there.

x86 and ARM can do

    ADD 1, addr

as well as other instructions like "compare and swap"

    LOAD addr, register
    MOV register, register2
    ADD 1, register2
    COMPARE_AND_SWAP addr, register, register2
    if (cas_failed) { try again }

unnah · 2025-11-17T17:02:51 1763398971

On MIPS you can simulate atomics with a load-linked/store-conditional (LL/SC) loop. If another processor has changed the same address between the LL and SC instructions, the SC fails to store the result and you have to retry. The underlying idea is that the processors would have to communicate memory accesses to each other via the cache coherence protocol anyway, so they can easily detect conflicting writes between the LL and SC instructions. It gets more complicated with out-of-order execution...

    loop: LL r2, (r1)
          ADD r3, r2, 1
          SC r3, (r1)
          BEQ r3, 0, loop
          NOP

accelbred · 2025-11-17T16:49:29 1763398169

Unfortunately, for C++, thats not true. At least with glibc and libstdc++, if you do not link with pthreads, then shared pointers are not thread-safe. At runtime it will do a symbol lookup for a pthreads symbol, and based off the result, the shared pointer code will either take the atomic or non-atomic path.

I'd much rather it didnt try to be zero-cost and it always used atomics...

TuxSH · 2025-11-17T18:42:45 1763404965

True, but that's a fault of the implementation, which assumes POSIX is the only thing in town & makes questionable optimization choices, rather that of the language itself

(for reference, the person above is referring to what's described here: https://snf.github.io/2019/02/13/shared-ptr-optimization/)

wyldfire · 2025-11-17T19:46:12 1763408772

> the language itself

The "language" is conventionally thought of as the sum of the effects given by the { compiler + runtime libraries }. The "language" often specifies features that are implemented exclusively in target libraries, for example. You're correct to say that they're not "language features" but the two domains share a single label like "C++20" / "C11" - so unless you're designing the toolchain it's not as significant a difference.

We're down to ~three compilers: gcc, clang, MSVC and three corresponding C++ libraries.

TuxSH · 2025-11-19T11:02:36 1763550156

I agree with what you said, however neither libc++ nor MS-STL have this "optimization" to my knowledge

woodruffw · 2025-11-17T17:08:30 1763399310

This is, impressively, significantly worse than I realized!

eddd-ddde · 2025-11-17T19:11:03 1763406663

Why use atomics if you don't need them? There really should just be two different shared pointer types.

accelbred · 2025-11-17T22:56:41 1763420201

I wouldn't mind two types. I mind shared pointer not using atomics if I statically link pthreads and dlload a shared lib with them, or if Im doing clone3 stuff. Ive had multiple situations in which the detection method would turn off atomic use when it actually needs to be atomic.

spacedcowboy · 2025-11-17T13:46:12 1763387172

The number of times I might want to write something in C and have it less likely to crash absolutely dwarfs the number of times I care about that code being cross-platform.

Sure, cross-platform is desirable, if there's no cost involved, and mandatory if you actually need it, but it's a "nice to have" most of the time, not a "needs this".

As for mutex overheads, yep, that's annoying, but really, how annoying ? Modern CPUs are fast. Very very fast. Personally I'm far more likely to use an os_unfair_lock_t than a pthread_mutex_t (see the previous point) which minimizes the locking to a memory barrier, but even if locking were slow, I think I'd prefer safe.

Rust is, I'm sure, great. It's not something I'm personally interested in getting involved with, but it's not necessary for C (or even this extra header) to do everything that Rust can do, for it to be an improvement on what is available.

There's simply too much out there written in C to say "just use Rust, or Swift, or ..." - too many libraries, too many resources, too many tutorials, etc. You pays your money and takes your choice.

woodruffw · 2025-11-17T14:02:29 1763388149

That’s all reasonable, but here’s one of the primary motivations from the post:

> We love its raw speed, its direct connection to the metal

If this is a strong motivating factor (versus, say, refactoring risk), then C’s lack of safe zero-cost abstractions is a valid concern.

lelanthran · 2025-11-17T16:18:29 1763396309

> As for mutex overheads, yep, that's annoying, but really, how annoying ?

For this use-case, you might not notice. ISTR, when examing the pthreads source code for some platform, that mutexes only do a context switch as a fallback, if the lock cannot be acquired.

So, for most use-cases of this header, you should not see any performance impact. You'll see some bloat, to be sure.

lmm · 2025-11-18T01:33:52 1763429632

> There's simply too much out there written in C to say "just use Rust, or Swift, or ..." - too many libraries, too many resources, too many tutorials, etc.

There really isn't. Speaking as someone who works in JVM-land, you really can avoid C all the time if you're willing to actually try.

spacedcowboy · 2025-11-19T09:12:47 1763543567

shrug horses for courses. I’m at that wonderful stage of life where I only code what I want to, I don’t have people telling me what to do. I’m not going to throw away decades of code investment for some principle that I don’t really care about - if I did care more, I’d probably be more invested in rust after all.

Plus, a lot of what I do is on microcontrollers with tens of kilobytes of RAM, not big-iron massively parallel servers where Java is commonly used. The vendor platform libraries are universally provided in C, so unless you want to reimplement the SPI or USB handler code, and probably write the darn rust implementation/Java virtual machine, and somehow squeeze it all in, then no, you can’t really avoid C.

Or assembler for that matter, interrupt routines often need assembly language to get latency down, and memory management (use this RAM address range because it’s “TCM” 1-clock latency, otherwise it’s 5 or 6 clocks and everything breaks…)

lelanthran · 2025-11-17T16:14:30 1763396070

> Intentionally or not, this post demonstrates one of the things that makes safer abstractions in C less desirable: the shared pointer implementation uses a POSIX mutex, which means it’s (1) not cross platform, and (2) pays the mutex overhead even in provably single-threaded contexts. In other words, it’s not a zero-cost abstraction.

It's an implementation detail. They could have used atomic load/store (since c11) to implement the increment/decrement.

TBH I'm not sure what a mutex buys you in this situation (reference counting)

saurik · 2025-11-17T14:08:06 1763388486

I'd think a POSIX mutex--a standard API that I not only could implement anywhere, but which has already been implemented all over the place--is way more "cross platform" than use of atomics.

woodruffw · 2025-11-17T14:32:57 1763389977

To lift things up a level: I think a language’s abstractions have failed if we even need to have a conversation around what “cross platform” really means :-)

jhatemyjob · 2025-11-17T20:57:51 1763413071

If that's the bar, what language's abstractions haven't failed?

ruined · 2025-11-18T02:22:49 1763432569

wasm and lambda calculus

wat10000 · 2025-11-17T19:32:00 1763407920

If you're targeting a vaguely modern C standard, atomics win by being part of the language. C11 has atomics and it's straightforward to use them to implement thread-safe reference counting.

aidenn0 · 2025-11-17T16:43:10 1763397790

> the shared pointer implementation uses a POSIX mutex

Do you have a source for this? I couldn't find the implementation in TFA nor a link to safe_c.h

layer8 · 2025-11-17T17:32:19 1763400739

The shared-pointer implementation isn’t actually shown (i.e. shared_ptr_copy), and the SharedPtr type doesn’t use a pthread_mutex_t.

kev009 · 2025-11-17T17:13:16 1763399596

C11 has a mutex API (threads.h), so why would it rely on POSIX? Are you sure it's not an runtime detail on one platform? https://devblogs.microsoft.com/cppblog/c11-threads-in-visual...

loeg · 2025-11-17T22:34:42 1763418882

The article has an excerpt using posix mutexes specifically. But you're right that C11 code can just portably use standard mutexes.

  // The old way of manual reference counting
  typedef struct {
      MatchStore* store;
      int ref_count;
      pthread_mutex_t mutex;
  } SharedStore;

kazinator · 2025-11-18T02:09:00 1763431740

ISO C has had mutexes since C11 I think.

In any case, you could use the provided primitives to wrap the C11 mutex, or any other mutex.

With some clever #ifdef, you can probably have a single or multithreaded build switch at compile time which makes all the mutex stuff do nothing.

nurettin · 2025-11-19T03:54:10 1763524450

Rust pays the cumbersome lifetime syntax tax even in provably single threaded contexts. When will Rust develop ergonomics with better defaults and less boilerplate in such contexts?

loeg · 2025-11-17T22:24:27 1763418267

Tecnhically the mutex refcounting example is shown as an example of the before the header the author is talking about. We don't know what they've chosen to implement shared_ptr with.

up2isomorphism · 2025-11-18T07:05:19 1763449519

it is quite obvious which one is easier: type bunch of ifdefs vs learn a new language.

BTW don’t fight C for portability, it is unlikely you will win.

cryptonector · 2025-11-17T22:53:33 1763420013

Meh, it could easily use atomics instead, no lock needed.