More

reitzensteinm · 2025-12-19T21:10:02 1766178602

Hill climbing a password would only be possible if intermediate KV cache entries were stored. To hillclimb "hunter2", you're going to try "a", "b", "c", etc, until you notice that "h" comes back faster. Then you try "ha", "hb" and so on.

But that's only going to work if the cache looks like: "h", "hu", "hun", ..., "hunter2"

If just "hunter2" is in the cache, you won't get any signal until you stumble on exactly that password. And that's before getting into the block size granularity of the caches discussed elsewhere in this thread.

That's not to say timing attacks aren't possible. I haven't looked at Claude Code's prompt generation, but there's no intrinsic reason why you couldn't do things like figure out what open source code and research papers your competitors are loading into context.

Sharing caches between orgs would be an incredible misstep.

jgeralnik · 2025-12-19T22:15:18 1766182518

Right, you can’t actually guess a letter (byte) at a time but you can guess a token at a time (I believe the vocabulary is 200000 possible tokens in gpt 5) So you could send each of the 200000 possible tokens, see which is cached, and then send 200000 more tokens to find the next cached token Certainly less efficient but well within the realm of a feasible attack

reitzensteinm · 2025-12-19T23:28:13 1766186893

It's a good call out re: tokens vs letters, but I think you might have misunderstood my point - you can't do it a token at a time unless the intermediate KV cache is stored after each token is generated.

This won't be the case in any non toy implementation, as it would be unneccessary and slow.

jgeralnik · 2025-12-20T05:51:16 1766209876

Ah, fair enough. Anthropic caches at a block level (basically a single message) so for non-trivial messages this is really less of a concern, although I definitely understand why they still scope cache to a single tenant

reitzensteinm · 2025-11-02T14:56:26 1762095386

I'm a little nervous about the correctness of the memory orderings in this project, e.g.

Two acquires back to back are unnecessary here. In general, fetch_sub and fetch_add should give enough guarantees for this file in Relaxed. https://github.com/frostyplanet/crossfire-rs/blob/master/src...

Congest is never written to with release, so the Acquire is never used to form a release chain: https://github.com/frostyplanet/crossfire-rs/blob/dd4a646ca9...

The queue appears to close the channel twice (once per rx/tx), which is discordant with the apparent care taken with the fencing. https://github.com/frostyplanet/crossfire-rs/blob/dd4a646ca9...

The author also suggests an incorrect optimization to Tokio here which suggests a lack of understanding of what the specific guarantees given are: https://github.com/tokio-rs/tokio/pull/7622

The tests do not appear to simulate the queue in Loom, which would be a very, very good idea.

This stuff is hard. I almost certainly made a mistake in what I've written above (edit: I did!). In practice, the queue is probably fine to use, but I wouldn't be shocked if there's a heisenbug lurking in this codebase that manifests something like: it all works fine now, but in the next LLVM version an optimization pass is added which breaks it on ARM in release mode, and after that the queue yields duplicate values in a busy loop every few million reads which is only triggered on Graviton processors.

Or something. Like I said, this stuff is hard. I wrote a very detailed simulator for the Rust/C++ memory model, have implemented dozens of lockless algorithms, and I still make a mistake every time I go to write code. You need to simulate it with something like Loom to have any hope of a robust implementation.

For anyone interested in learning about Rust's memory model, I can't recommend enough Rust Atomics and Locks:

https://marabos.nl/atomics/

embedding-shape · 2025-11-02T18:12:59 1762107179

> The tests do not appear to simulate the queue in Loom, which would be a very, very good idea.

Loom is apparently this: https://github.com/tokio-rs/loom I've used tokio a bit in the past, but wasn't aware of that tool at all, looks really useful and probably I'm not alone in never hearing about it before. Any tips&tricks or gotchas with it one should know beforehand?

reitzensteinm · 2025-10-15T07:24:27 1760513067

I'm not going to thumb my nose at CPU design content from folks that aren't good at public speaking. They're almost entirely distinct skill sets.

actionfromafar · 2025-10-15T08:53:43 1760518423

Also, the Venn Diagram between (good public speech) and (good public speech which also looks good when transcribed) is probably pretty thin.

reitzensteinm · 2025-09-28T02:37:57 1759027077

It's too early to tell.

reitzensteinm · 2025-09-09T02:51:50 1757386310

This comment is a nugget of gold - I hadn't thought about it in those terms before but it makes total sense. Thank you!

reitzensteinm · 2025-08-12T22:52:07 1755039127

One of my favourite YouTubers, Matt Orchard, did a video on cults that included Heaven's Gate. He interviews a surviving member. If this article interested you, it's worth a watch (the beginning is a bit silly):

https://www.youtube.com/watch?v=L9F-vb7s3DE

reitzensteinm · 2025-08-10T23:10:29 1754867429

OpenAI automatically caches prompt prefixes on the API. Caching an infrequently changing internally controlled system prompt is trivial by comparison.

reitzensteinm · 2025-08-05T08:12:56 1754381576

See Tokio's Loom as an example: https://github.com/tokio-rs/loom

In development, you import Loom's mutex. In production, you import a regular mutex. This of course has zero overhead, but the simulation testing itself is usually quite slow. Only one thread can execute at a time, and many iterations are required.

reitzensteinm · 2025-07-27T01:43:54 1753580634

This is a misreading of their website. On the left, they compare the EPYC 9965 (launched 10/10/24) with the Xeon Platinum 8280 (launched Q2 '19) and make a TCO argument for replacing outdated Intel servers with AMD.

On the right, they compare the EPYC 9965 (launched 10/10/24) with the Xeon Platinum 8592+ (launched Q4 23), a like for like comparison against Intel's competition at launch.

The argument is essentially in two pieces - "If you're upgrading, you should pick AMD. If you're not upgrading, you should be."

ashvardanian · 2025-07-27T09:09:08 1753607348

It’s true that they compare to different Intel CPUs in different parts of the webpage, and I don’t always understand the intentions behind those comparisons.

Still, if you decode the unreadable footnotes 2 & 3 in the bottom of the page - a few things stand out: avoiding AMX, using CPUs with different core-counts & costs, and even running on a different Linux kernel version, which may affect scheduling…

reitzensteinm · 2025-07-11T11:54:48 1752234888

Or maybe Quack III: Arena. https://m.slashdot.org/story/21054

bayindirh · 2025-07-11T13:36:30 1752240990

Ooh, I remember this, but actually the thing is older than it.

First, nVidia and ATI used executable names for detecting games, then they started to add heuristics.

If you think they stopped the practice, you're very mistaken. Every AMD and nVidia driver has game and app specific fixes and optimizations.

nVidia cheated in 3D Mark that way, so they patched/changed their benchmark to prevent it. Also, again nVidia, patched their drivers so some of the more expensive but visually invisible calls like scene flushes in a particular game is batched (e.g. do all 50 flushes at the 50th call) to prevent the game becoming a slide show on expensive hardware.

This is also why AMDs and Intel's open source drivers under Linux a success, because they are vanilla drivers written from scratch per spec, and if your code calls OpenGL/Vulkan to spec, then you're golden.

Even some companies cross compile AMD's Linux drivers for windows on embedded systems since they're free from useless optimizations from them.

dahauns · 2025-07-11T13:32:38 1752240758

Aah, that brings back memories...

Interestingly, most benchmark controversies back in the day are now expected behaviour, i.e. game-specific optimizations with no (well, in this age of upscalers and other lossy optimization techniques, probably even somewhat) visible image degradation. A gaming-specific driver with no game-specific improvements in its changelog would be considered strange, and it very much works with executable detection.

Back in the day, there was still the argument that drivers should not optimize for benchmarks even when visually identical, because it wouldn't show the hardware's real world potential. Kinda cute from today's perspective. :)

But of course there were the obvious cases...

The Quack3 lowering filtering quality as shown above, of course (at least that one was put into the driver as a togglable setting later on).

But the most cheeky one has to be nVidia's 3dmark03 "optimizations", where they blatantly put static clip planes into the scenes so that everything outside the predefined camera path from the benchmark sequence would simply be cut from the scene early (which e.g. fully broke the freelook patched into 3dmark and would generally break any interactive application)

bayindirh · 2025-07-11T13:37:23 1752241043

You beat me to it. Grrr...

Just kidding, nice to see another person who remembers these things. Want some root beer?

BoredPositron · 2025-07-11T12:14:22 1752236062

Now I want a Quake shooter but with ducks.

carlos22 · 2025-07-11T13:01:21 1752238881

Not ducks, but chickens, was very popular in Germany back in the day: https://en.wikipedia.org/wiki/Crazy_Chicken

avhception · 2025-07-11T14:27:19 1752244039

Oh wow, that was a blast from the past. The Moorhuhn craze!

Many people, including me, didn't have an internet connection back in the day. The Sneakernet went into overdrive so get everyone a copy!

supportengineer · 2025-07-11T14:39:50 1752244790

A Duck Hunt, if you will…

iforgotpassword · 2025-07-11T12:01:54 1752235314

I think that was the first case (to go public), but I remember reading about this in game magazines a couple times after this, for both ATI and nvidia.