here's a quick recording from the 20b model on my 128GB M4 Max MBP: https://asci...

Rhubarrbb · 2025-08-05T18:39:44 1754419184

Generation is usually fast, but prompt processing is the main limitation with local agents. I also have a 128 GB M4 Max. How is the prompt processing on long prompts? processing the system prompt for Goose always takes quite a while for me. I haven't been able to download the 120B yet, but I'm looking to switch to either that or the GLM-4.5-Air for my main driver.

ghc · 2025-08-05T19:05:54 1754420754

Here's a sample of running the 120b model on Ollama with my MBP:

```

total duration: 1m14.16469975s

load duration: 56.678959ms

prompt eval count: 3921 token(s)

prompt eval duration: 10.791402416s

prompt eval rate: 363.34 tokens/s

eval count: 2479 token(s)

eval duration: 1m3.284597459s

eval rate: 39.17 tokens/s

```

andai · 2025-08-05T20:23:09 1754425389

You mentioned "on local agents". I've noticed this too. How do ChatGPT and the others get around this, and provide instant responses on long conversations?

bluecoconut · 2025-08-05T22:15:46 1754432146

Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.

mike_hearn · 2025-08-06T09:13:23 1754471603

They cache the intermediate data (KV cache).

anonymoushn · 2025-08-05T18:50:25 1754419825

it's odd that the result of this processing cannot be cached.

lostmsu · 2025-08-05T19:00:03 1754420403

It can be and it is by most good processing frameworks.

Davidzheng · 2025-08-05T18:37:10 1754419030

the active param count is low so it should be fast.