Hill climbing a password would only be possible if intermediate KV cache entries were stored. To hillclimb "hunter2", you're going to try "a", "b", "c", etc, until you notice that "h" comes back faster. Then you try "ha", "hb" and so on.
But that's only going to work if the cache looks like: "h", "hu", "hun", ..., "hunter2"
If just "hunter2" is in the cache, you won't get any signal until you stumble on exactly that password. And that's before getting into the block size granularity of the caches discussed elsewhere in this thread.
That's not to say timing attacks aren't possible. I haven't looked at Claude Code's prompt generation, but there's no intrinsic reason why you couldn't do things like figure out what open source code and research papers your competitors are loading into context.
Sharing caches between orgs would be an incredible misstep.
Right, you can’t actually guess a letter (byte) at a time but you can guess a token at a time (I believe the vocabulary is 200000 possible tokens in gpt 5)
So you could send each of the 200000 possible tokens, see which is cached, and then send 200000 more tokens to find the next cached token
Certainly less efficient but well within the realm of a feasible attack
It's a good call out re: tokens vs letters, but I think you might have misunderstood my point - you can't do it a token at a time unless the intermediate KV cache is stored after each token is generated.
This won't be the case in any non toy implementation, as it would be unneccessary and slow.
Ah, fair enough. Anthropic caches at a block level (basically a single message) so for non-trivial messages this is really less of a concern, although I definitely understand why they still scope cache to a single tenant
The author also suggests an incorrect optimization to Tokio here which suggests a lack of understanding of what the specific guarantees given are:
https://github.com/tokio-rs/tokio/pull/7622
The tests do not appear to simulate the queue in Loom, which would be a very, very good idea.
This stuff is hard. I almost certainly made a mistake in what I've written above (edit: I did!). In practice, the queue is probably fine to use, but I wouldn't be shocked if there's a heisenbug lurking in this codebase that manifests something like: it all works fine now, but in the next LLVM version an optimization pass is added which breaks it on ARM in release mode, and after that the queue yields duplicate values in a busy loop every few million reads which is only triggered on Graviton processors.
Or something. Like I said, this stuff is hard. I wrote a very detailed simulator for the Rust/C++ memory model, have implemented dozens of lockless algorithms, and I still make a mistake every time I go to write code. You need to simulate it with something like Loom to have any hope of a robust implementation.
For anyone interested in learning about Rust's memory model, I can't recommend enough Rust Atomics and Locks:
> The tests do not appear to simulate the queue in Loom, which would be a very, very good idea.
Loom is apparently this: https://github.com/tokio-rs/loom I've used tokio a bit in the past, but wasn't aware of that tool at all, looks really useful and probably I'm not alone in never hearing about it before. Any tips&tricks or gotchas with it one should know beforehand?
One of my favourite YouTubers, Matt Orchard, did a video on cults that included Heaven's Gate. He interviews a surviving member. If this article interested you, it's worth a watch (the beginning is a bit silly):
In development, you import Loom's mutex. In production, you import a regular mutex. This of course has zero overhead, but the simulation testing itself is usually quite slow. Only one thread can execute at a time, and many iterations are required.
This is a misreading of their website. On the left, they compare the EPYC 9965 (launched 10/10/24) with the Xeon Platinum 8280 (launched Q2 '19) and make a TCO argument for replacing outdated Intel servers with AMD.
On the right, they compare the EPYC 9965 (launched 10/10/24) with the Xeon Platinum 8592+ (launched Q4 23), a like for like comparison against Intel's competition at launch.
The argument is essentially in two pieces - "If you're upgrading, you should pick AMD. If you're not upgrading, you should be."
It’s true that they compare to different Intel CPUs in different parts of the webpage, and I don’t always understand the intentions behind those comparisons.
Still, if you decode the unreadable footnotes 2 & 3 in the bottom of the page - a few things stand out: avoiding AMX, using CPUs with different core-counts & costs, and even running on a different Linux kernel version, which may affect scheduling…
Ooh, I remember this, but actually the thing is older than it.
First, nVidia and ATI used executable names for detecting games, then they started to add heuristics.
If you think they stopped the practice, you're very mistaken. Every AMD and nVidia driver has game and app specific fixes and optimizations.
nVidia cheated in 3D Mark that way, so they patched/changed their benchmark to prevent it. Also, again nVidia, patched their drivers so some of the more expensive but visually invisible calls like scene flushes in a particular game is batched (e.g. do all 50 flushes at the 50th call) to prevent the game becoming a slide show on expensive hardware.
This is also why AMDs and Intel's open source drivers under Linux a success, because they are vanilla drivers written from scratch per spec, and if your code calls OpenGL/Vulkan to spec, then you're golden.
Even some companies cross compile AMD's Linux drivers for windows on embedded systems since they're free from useless optimizations from them.
Interestingly, most benchmark controversies back in the day are now expected behaviour, i.e. game-specific optimizations with no (well, in this age of upscalers and other lossy optimization techniques, probably even somewhat) visible image degradation. A gaming-specific driver with no game-specific improvements in its changelog would be considered strange, and it very much works with executable detection.
Back in the day, there was still the argument that drivers should not optimize for benchmarks even when visually identical, because it wouldn't show the hardware's real world potential. Kinda cute from today's perspective. :)
But of course there were the obvious cases...
The Quack3 lowering filtering quality as shown above, of course (at least that one was put into the driver as a togglable setting later on).
But the most cheeky one has to be nVidia's 3dmark03 "optimizations", where they blatantly put static clip planes into the scenes so that everything outside the predefined camera path from the benchmark sequence would simply be cut from the scene early (which e.g. fully broke the freelook patched into 3dmark and would generally break any interactive application)
I think that was the first case (to go public), but I remember reading about this in game magazines a couple times after this, for both ATI and nvidia.
But that's only going to work if the cache looks like: "h", "hu", "hun", ..., "hunter2"
If just "hunter2" is in the cache, you won't get any signal until you stumble on exactly that password. And that's before getting into the block size granularity of the caches discussed elsewhere in this thread.
That's not to say timing attacks aren't possible. I haven't looked at Claude Code's prompt generation, but there's no intrinsic reason why you couldn't do things like figure out what open source code and research papers your competitors are loading into context.
Sharing caches between orgs would be an incredible misstep.
reply