I think the M1 chip *finally* proves the inherent design superiority of RISC ove...

rayiner · on Dec 11, 2020

ARM64 can't be that easy to decode, since ARM's recent high-performance cores (A78, X1) decode ARM64 instructions into MOPS and feature a MOP cache: https://www.anandtech.com/show/15813/arm-cortex-a78-cortex-x.... And we don't know that the M1 doesn't do that. Also, even on Zen 2, the entire decode section is still a fraction of the size of say the vector units: https://forums.anandtech.com/threads/annotated-hi-res-core-d.... And the cores themselves take up a small amount of the die space on a modern CPU: https://cdn.mos.cms.futurecdn.net/m22pkncJXbqSMVisfrWcZ5-102....

A bet doing 8-wide x86 decoding would be tough, but once you've got a micro-up cache, it's doable so long as you have a cache hit. Zen 3 is 8-wide the 95% of the time you hit the micro-up cache.

The real question is how does Apple keep that thing fed? An 8-wide decoder is pointless if most of the time you've got 6 empty pipelines: https://open.hpi.de/courses/parprog2014/items/aybclrPgY4nPyY... (discussing ILP wall). M1 outperforming Zen 3 by 20% on the SPEC GCC benchmark, at 1/3 lower clocks-speed. That's 80% more ILP than an Zen 3, which is itself a large advance in ILP.

titzer · on Dec 11, 2020

> ARM64 can't be that easy to decode, since ARM's recent high-performance cores (A78, X1) decode ARM64 instructions into MOPS and feature a MOP cache

My point was more about that the fixed-width instructions allow trivial parallel decoding while x86 requires predicting the length of instructions or just brute forcing all possible offsets, which is costly.

> The real question is how does Apple keep that thing fed?

That's why there's such an enormous reorder buffer. It's so that there's a massive amount of potential work out there for execution ports to pick up and do. Of course, that's all wasted when you have a branch mispredict. I haven't seen anything specific about M1's branch prediction, but it is clearly top-notch.

cma · on Dec 11, 2020

Other things helping are forced >=16Kb page sizes, and massive L1 caches (M1 has 4X Zen 3's L1 data cache and 3X the L1 instruction cache; how much of that cache size is enabled by new process node and larger page sizes vs just lack of x86 decode I don't know).

bcrl · on Dec 12, 2020

L1 cache size is driven by the target clock frequency. Apple is not aiming for 5+GHz, whereas both Intel and AMD cores can turbo above 5GHz these days.