More

sujayakar · 2026-01-22T23:28:43 1769124523

I'd absolutely love to play with this. one idea I had is to train another model to create bitmaps of sidewalks and roads and add a simulation for pedestrians and cars. day/night cycle would also be so cool!

sujayakar · 2025-06-12T18:35:52 1749753352

switch to auto mode and it should still work!

ashu1461 · 2025-06-12T18:39:55 1749753595

GPT is working in agent mode, which kind of confirms that claude is hosted on google and GPT probably on MSFT servers / self hosted.

kenhwang · 2025-06-12T20:29:43 1749760183

If you want a stronger confirmation about Claude being hosted on GCP, this is about as authoritative as it gets: https://www.anthropic.com/news/anthropic-partners-with-googl...

mkl · 2025-06-13T01:26:39 1749777999

That's nearly 2.5 years old, an eternity in this space. It may still be true, but that article is not good evidence.

scottmf · 2025-06-12T20:49:13 1749761353

Claude runs on AWS afaik. And OAI on Azure. Edit: oh okay maybe GCP too then. I’m personally having no problem using Claude Code though.

sujayakar · 2025-04-22T15:00:22 1745334022

Here's an interesting negative result.

After watching this video, my first thought was whether recent results from columnar compression (e.g. https://docs.vortex.dev/references#id1) applied "naively" like QOI would have good results.

I started with a 1.79MiB sprite file for a 2D game I've been hacking on, and here are the results:

  PNG: 1.79 MiB
  QOI: 2.18 MiB
  BtrBlocks: 3.69 MiB

(Source: https://gist.github.com/sujayakar/aab7b4e9df01f365868ec7ca60...)

So, there's magic to being Quite OK that is more than just applying compression techniques than elsewhere :)

sujayakar · 2025-03-27T21:43:43 1743111823

Deepseek is already using SSDs for their KV cache: https://github.com/deepseek-ai/3FS

vlovich123 · 2025-03-27T22:20:37 1743114037

You are deeply misunderstanding what the KV cache referred to here is. It’s not for storing data. This is the KV cache that’s part of the model to reduce quadratic compute complexity into linear for self attention. This is not stored on SSD - it’s in VRAM (or CPU if you’re not using a GPU)

boroboro4 · 2025-03-27T22:33:38 1743114818

They, in fact, mention inference kv cache as use case in readme. The most advanced kv caching uses hierarchy of gpu ram/regular ram/ssd. Seems like they were able to use their storage abstraction for last tier.

magicalhippo · 2025-03-28T02:47:08 1743130028

https://github.com/deepseek-ai/3FS?tab=readme-ov-file#3-kvca...

KVCache is a technique used to optimize the LLM inference process. It avoids redundant computations by caching the key and value vectors of previous tokens in the decoder layers. The top figure demonstrates the read throughput of all KVCache clients (1×400Gbps NIC/node), highlighting both peak and average values, with peak throughput reaching up to 40 GiB/s

vlovich123 · 2025-03-28T04:18:27 1743135507

That's because DeepSeek uses MLA which apparently does allow offloading the KV cache. That doesn't apply to all models, particularly the open-weight models that are primarily GQA AFAIK.

boroboro4 · 2025-03-28T04:52:24 1743137544

Any models allow offloading KV cache, it’s not a matter of model architecture but only of the implementation. The only somewhat difference can be for non transformer models. But for all attention models it’s the same – blob of data per each token. It’s much worse for older models with MHA because their KV cache is just too big, and it’s better for DeepSeek because their KV cache is the smallest. But it’s alright for current generation of GQA models as well.

vlovich123 · 2025-03-28T07:28:02 1743146882

Are you sure about that? GQA applies self-attention to every KV cache entry. If you're offloading, then you're having to dynamically page in all the KV cache entries into the GPU which is quite slow since the CPU/GPU link only has so much bandwidth. My understanding is that MLA reduces the size of the KV cache & doesn't necessarily attend to every KV token at every step which is why offloading to disk works (i.e. most of the tokens can remain on disk without ever being loaded into the GPU).

boroboro4 · 2025-03-28T12:09:24 1743163764

Offloading in this case doesn’t mean keeping the kv cache on the disk/in storage all the time, it means keeping it there when request isn’t in process of generation. While request being generated kv cache is indeed in vram.

As for MLA - Deepseek is, just like others, attend to all historical tokens. The only difference instead of having actual KV entries it has lower dimension KV entries, which are being projected into full blown KV entries on the fly during attention. It’s similar to GQA, just instead of just duplication KV entries by size of groups it applies linear transformation.

vlovich123 · 2025-03-28T15:42:41 1743176561

Ah OK. So this is for resuming chat context cheaply. What I said is still correct - 3FS is not part of the inference flow & not relevant to the paper which is about optimizing the KV cache usage at runtime.

sujayakar · 2025-03-06T19:32:14 1741289534

I really love this space: Navarro's book is an excellent survey.

Erik Demaine has a few great lectures on succinct data structures too: L17 and L18 on https://courses.csail.mit.edu/6.851/spring12/lectures/

sujayakar · on Jan 12, 2025

that's roughly what the wasm component model is aiming for!

https://hacks.mozilla.org/2019/11/announcing-the-bytecode-al...

sujayakar · on Jan 7, 2025

can you specify the algorithm in more detail?

this looks to be solving a different problem than A*, which operates over discrete graphs. this looks to be operating in 2D continuous space instead.

so, what is the algorithm for finding the optimal point on the obstacle's outline for bypass (4)? is it finding the point on the outline nearest the destination?

then, how do you subsequently "backtrack" to a different bypass point on the obstacle if the first choice of bypass point doesn't work out?

there's something interesting here for trying to directly operate on 2D space rather than discretizing it into a graph, but I'm curious how the details shake out.

Farer · on Jan 8, 2025

The algorithm for finding detour points is as follows. In fact, I’ve improved it a bit through research:

1. Detect a collision with an obstacle on the straight path connecting the starting point and the destination. 2. Decide which direction to explore along the obstacle's outline (for now, the side closer to the destination). 3. If the end of the visible outline is reached, search for an appropriate detour point around that outline. 4. Select a detour point where a straight-line movement from the starting point avoids the obstacle, preferably closer to the destination.

---

If the first detour point selection fails, I plan to search in the *opposite direction* along the outline where the obstacle was first encountered. I’m currently working on resolving this part.

You can check out my progress here: https://github.com/Farer/bw_path_finding

sujayakar · on Jan 1, 2025

this is unbelievably cool. ~27ns overhead for searching for a u32 in a 4GB set in memory is unreal.

it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).

smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.

the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.

curiouscoding · on Jan 1, 2025

Just fyi: the throughput numbers with batching are per _query_, not per _batch_, so I think the *8 is too optimistic ")

I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small. That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).

Bulat_Ziganshin · on Jan 3, 2025

with m/t, the algorithm is memory-bound, so the performance should be determined strictly by the memory throughput

sujayakar · on Dec 29, 2024

I love playing with it at UltraHigh quality and 1 solver iterations. It reminds me of gradually incorporating one ingredient into another when cooking: like incorporating flour into eggs when making pasta.

sujayakar · on Nov 5, 2024

+1. I'd be curious how much of a pessimization to uncontended workloads it'd be to just use `tokio::sync::RwLock`.

and, if we want to keep it as a spinlock, I'm curious how much the immediate wakeup compares to using `tokio::task::yield_now`: https://docs.rs/tokio/latest/tokio/task/fn.yield_now.html

willothy · on Nov 5, 2024

This is an interesting idea. I am gonna try this out - especially with dashmap, I think that could perform very well.

zamalek · on Nov 5, 2024

You could also look into shamelessly "taking inspiration" from async-lock.