When people talk about running a (quantized) medium-sized model on a Mac Mini, w...

phonon · 2025-08-05T17:57:13 1754416633

Here's a 4bit 70B parameter model, https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b Q4_K_M) on a M4 Max 128 GB. Usable, but not very performant.

davio · 2025-08-05T18:49:47 1754419787

On a M1 MacBook Air with 8GB, I got this running Gemma 3n:

12.63 tok/sec • 860 tokens • 1.52s to first token

I'm amazed it works at all with such limited RAM

v5v3 · 2025-08-05T18:55:16 1754420116

I have started a crowdfunding to get you a MacBook air with 16gb. You poor thing.

AtlasBarfed · 2025-08-05T22:03:58 1754431438

Y not meeee?

After considering my sarcasm for the last 5 minutes, I am doubling down. The government of the United States of America should enhance its higher IQ people by donating AI hardware to them immediately.

This is critical for global competitive economic power.

Send me my hardware US government

xwolfi · 2025-08-06T02:50:52 1754448652

higher IQ people <-- well you have to prove that first, so let me ask you a test question to prove them: how can you mix collaboration and competition in society to produce the optimal productivity/conflict ratio ?

bookofjoe · 2025-08-05T19:18:06 1754421486

Up the ante with an M4 chip

backscratches · 2025-08-05T19:31:40 1754422300

not meaningfully different, m1 virtually as fast as m4

wahnfrieden · 2025-08-05T19:44:32 1754423072

https://github.com/devMEremenko/XcodeBenchmark M4 is almost twice as fast as M1

andai · 2025-08-05T20:20:31 1754425231

In this table, M4 is also twice as fast as M4.

wahnfrieden · 2025-08-06T00:14:08 1754439248

You're comparing across vanilla/Pro/Max tiers. within equivalent tier, M4 is almost 2x faster than M1

v5v3 · 2025-08-06T05:54:16 1754459656

Twice the cost too.

wahnfrieden · 2025-08-06T16:08:46 1754496526

n42 · 2025-08-05T17:47:59 1754416079

here's a quick recording from the 20b model on my 128GB M4 Max MBP: https://asciinema.org/a/AiLDq7qPvgdAR1JuQhvZScMNr

and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM

I am, um, floored

Rhubarrbb · 2025-08-05T18:39:44 1754419184

Generation is usually fast, but prompt processing is the main limitation with local agents. I also have a 128 GB M4 Max. How is the prompt processing on long prompts? processing the system prompt for Goose always takes quite a while for me. I haven't been able to download the 120B yet, but I'm looking to switch to either that or the GLM-4.5-Air for my main driver.

ghc · 2025-08-05T19:05:54 1754420754

Here's a sample of running the 120b model on Ollama with my MBP:

```

total duration: 1m14.16469975s

load duration: 56.678959ms

prompt eval count: 3921 token(s)

prompt eval duration: 10.791402416s

prompt eval rate: 363.34 tokens/s

eval count: 2479 token(s)

eval duration: 1m3.284597459s

eval rate: 39.17 tokens/s

```

andai · 2025-08-05T20:23:09 1754425389

You mentioned "on local agents". I've noticed this too. How do ChatGPT and the others get around this, and provide instant responses on long conversations?

bluecoconut · 2025-08-05T22:15:46 1754432146

Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.

mike_hearn · 2025-08-06T09:13:23 1754471603

They cache the intermediate data (KV cache).

anonymoushn · 2025-08-05T18:50:25 1754419825

it's odd that the result of this processing cannot be cached.

lostmsu · 2025-08-05T19:00:03 1754420403

It can be and it is by most good processing frameworks.

Davidzheng · 2025-08-05T18:37:10 1754419030

the active param count is low so it should be fast.

a_wild_dandan · 2025-08-05T18:10:44 1754417444

GLM-4.5-air produces tokens far faster than I can read on my MacBook. That's plenty fast enough for me, but YMMV.