Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When people talk about running a (quantized) medium-sized model on a Mac Mini, what types of latency and throughput times are they talking about? Do they mean like 5 tokens per second or at an actually usable speed?


Here's a 4bit 70B parameter model, https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b Q4_K_M) on a M4 Max 128 GB. Usable, but not very performant.


On a M1 MacBook Air with 8GB, I got this running Gemma 3n:

12.63 tok/sec • 860 tokens • 1.52s to first token

I'm amazed it works at all with such limited RAM


I have started a crowdfunding to get you a MacBook air with 16gb. You poor thing.


Y not meeee?

After considering my sarcasm for the last 5 minutes, I am doubling down. The government of the United States of America should enhance its higher IQ people by donating AI hardware to them immediately.

This is critical for global competitive economic power.

Send me my hardware US government


higher IQ people <-- well you have to prove that first, so let me ask you a test question to prove them: how can you mix collaboration and competition in society to produce the optimal productivity/conflict ratio ?


Up the ante with an M4 chip


not meaningfully different, m1 virtually as fast as m4


https://github.com/devMEremenko/XcodeBenchmark M4 is almost twice as fast as M1


In this table, M4 is also twice as fast as M4.


You're comparing across vanilla/Pro/Max tiers. within equivalent tier, M4 is almost 2x faster than M1


Twice the cost too.


?


here's a quick recording from the 20b model on my 128GB M4 Max MBP: https://asciinema.org/a/AiLDq7qPvgdAR1JuQhvZScMNr

and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM

I am, um, floored


Generation is usually fast, but prompt processing is the main limitation with local agents. I also have a 128 GB M4 Max. How is the prompt processing on long prompts? processing the system prompt for Goose always takes quite a while for me. I haven't been able to download the 120B yet, but I'm looking to switch to either that or the GLM-4.5-Air for my main driver.


Here's a sample of running the 120b model on Ollama with my MBP:

```

total duration: 1m14.16469975s

load duration: 56.678959ms

prompt eval count: 3921 token(s)

prompt eval duration: 10.791402416s

prompt eval rate: 363.34 tokens/s

eval count: 2479 token(s)

eval duration: 1m3.284597459s

eval rate: 39.17 tokens/s

```


You mentioned "on local agents". I've noticed this too. How do ChatGPT and the others get around this, and provide instant responses on long conversations?


Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.


They cache the intermediate data (KV cache).


it's odd that the result of this processing cannot be cached.


It can be and it is by most good processing frameworks.


the active param count is low so it should be fast.


GLM-4.5-air produces tokens far faster than I can read on my MacBook. That's plenty fast enough for me, but YMMV.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: