I'd absolutely love to play with this. one idea I had is to train another model to create bitmaps of sidewalks and roads and add a simulation for pedestrians and cars. day/night cycle would also be so cool!
After watching this video, my first thought was whether recent results from columnar compression (e.g. https://docs.vortex.dev/references#id1) applied "naively" like QOI would have good results.
I started with a 1.79MiB sprite file for a 2D game I've been hacking on, and here are the results:
You are deeply misunderstanding what the KV cache referred to here is. It’s not for storing data. This is the KV cache that’s part of the model to reduce quadratic compute complexity into linear for self attention. This is not stored on SSD - it’s in VRAM (or CPU if you’re not using a GPU)
They, in fact, mention inference kv cache as use case in readme. The most advanced kv caching uses hierarchy of gpu ram/regular ram/ssd. Seems like they were able to use their storage abstraction for last tier.
KVCache is a technique used to optimize the LLM inference process. It avoids redundant computations by caching the key and value vectors of previous tokens in the decoder layers. The top figure demonstrates the read throughput of all KVCache clients (1×400Gbps NIC/node), highlighting both peak and average values, with peak throughput reaching up to 40 GiB/s
That's because DeepSeek uses MLA which apparently does allow offloading the KV cache. That doesn't apply to all models, particularly the open-weight models that are primarily GQA AFAIK.
Any models allow offloading KV cache, it’s not a matter of model architecture but only of the implementation. The only somewhat difference can be for non transformer models. But for all attention models it’s the same – blob of data per each token. It’s much worse for older models with MHA because their KV cache is just too big, and it’s better for DeepSeek because their KV cache is the smallest. But it’s alright for current generation of GQA models as well.
Are you sure about that? GQA applies self-attention to every KV cache entry. If you're offloading, then you're having to dynamically page in all the KV cache entries into the GPU which is quite slow since the CPU/GPU link only has so much bandwidth. My understanding is that MLA reduces the size of the KV cache & doesn't necessarily attend to every KV token at every step which is why offloading to disk works (i.e. most of the tokens can remain on disk without ever being loaded into the GPU).
Offloading in this case doesn’t mean keeping the kv cache on the disk/in storage all the time, it means keeping it there when request isn’t in process of generation. While request being generated kv cache is indeed in vram.
As for MLA - Deepseek is, just like others, attend to all historical tokens. The only difference instead of having actual KV entries it has lower dimension KV entries, which are being projected into full blown KV entries on the fly during attention. It’s similar to GQA, just instead of just duplication KV entries by size of groups it applies linear transformation.
Ah OK. So this is for resuming chat context cheaply. What I said is still correct - 3FS is not part of the inference flow & not relevant to the paper which is about optimizing the KV cache usage at runtime.
this looks to be solving a different problem than A*, which operates over discrete graphs. this looks to be operating in 2D continuous space instead.
so, what is the algorithm for finding the optimal point on the obstacle's outline for bypass (4)? is it finding the point on the outline nearest the destination?
then, how do you subsequently "backtrack" to a different bypass point on the obstacle if the first choice of bypass point doesn't work out?
there's something interesting here for trying to directly operate on 2D space rather than discretizing it into a graph, but I'm curious how the details shake out.
The algorithm for finding detour points is as follows.
In fact, I’ve improved it a bit through research:
1. Detect a collision with an obstacle on the straight path connecting the starting point and the destination.
2. Decide which direction to explore along the obstacle's outline (for now, the side closer to the destination).
3. If the end of the visible outline is reached, search for an appropriate detour point around that outline.
4. Select a detour point where a straight-line movement from the starting point avoids the obstacle, preferably closer to the destination.
---
If the first detour point selection fails, I plan to search in the *opposite direction* along the outline where the obstacle was first encountered.
I’m currently working on resolving this part.
this is unbelievably cool. ~27ns overhead for searching for a u32 in a 4GB set in memory is unreal.
it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).
smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.
the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.
Just fyi: the throughput numbers with batching are per _query_, not per _batch_, so I think the *8 is too optimistic ")
I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small.
That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).
I love playing with it at UltraHigh quality and 1 solver iterations. It reminds me of gradually incorporating one ingredient into another when cooking: like incorporating flour into eggs when making pasta.