More

JonChesterfield · 2026-02-15T14:29:26 1771165766

The game is deeper than that. Your model is probably about right for the compiler you're using. It shouldn't be - compilers can do better - but it's all a work in progress.

Small scale stuff is you don't usually spill around every call site. One of the calls is the special "return" branch, the other N can probably share some of the register shuffling overhead if you're careful with allocation.

Bigger is that the calling convention is not a constant. Leaf functions can get special cased, but so can non-leaf. Change the pattern of argument to fixed register / stack, change which registers are callee/caller saved. The entry point for calls from outside the current module needs to match the platform ABI you claimed it'll follow but nothing else does.

The inlining theme hints at this. Basic blocks _are_ functions that are likely to have a short list of known call sites, each of which can have the calling convention chosen by the backend, which is what the live in/out of blocks is about. It's not inlining that makes any difference to regalloc, it's being more willing to change the calling convention on each function once you've named it "basic block".

Joker_vD · 2026-02-16T08:44:17 1771231457

Why is almost no one in this comment thread is willing to face the scenario where the function call has to actually happen, and be an actual function call? The reactions are either "no-no-no-no, the call will be inlined, don't you worry your pretty head" or "well, then the compiler will just use less registers to make less spills" — which precisely agrees with my point that having more registers ain't necessarily all that useful.

> Small scale stuff is you don't usually spill around every call site.

Well duh: it's small, so even just 8 registers is likely enough for it. So again, why bother with cumbersome schemes to extend to 32 registers?

And this problem actually exists, that's why SPARC tried register windows and even crazier schemes on the software side of things had been proposed e.g. [0] — seriously, read this. And it's 30 years old, and IIUC nothing much came out of it so excuse me if I'm somewhat skeptical about "compilers can do better - but it's all a work in progress" claims. Perhaps they already do as best they can for general-purpose CPUs. Good thing we have other kinds processing units readily available nowadays.

[0] David W. Wall, "Global Register Allocation at Link Time", 1986, https://dl.acm.org/doi/10.1145/12276.13338

JonChesterfield · 2026-02-14T14:05:57 1771077957

Good post! Stuff I didn't know x64 has. Sadly doesn't answer the "how many registers are behind rax" question I was hoping for, I'd love to know how many outstanding writes one can have to the various architectural registers before the renaming machinery runs out and things stall. Not really for immediate application to life, just a missing part of my mental cost model for x64.

cmovq · 2026-02-15T02:10:50 1771121450

If you’re asking about the register file, it’s around a couple hundred registers varying by architecture.

You’d need several usages of the ISA register without dependencies to run out of physical registers. You’re more likely to be bottlenecked by execution ports or the decoder way before that happens.

JonChesterfield · 2026-02-15T13:52:59 1771163579

I've seen claims that it's different for different architectural registers, e.g. _lots_ of backing store for rax, less for rbx. It's likely to be significant for the vector registers too which could plausibly have features like one backing store for the various widths, in which case deliberately using the smaller vectors would sometimes win out. I'll never bother to write the asm by hand with that degree of attention but would like better cost models in the compiler backend.

adrian_b · 2026-02-16T16:53:11 1771260791

In the Intel-AMD CPUs, there are separate register files for renaming the 16 general-purpose registers (which will become 32 registers in Intel Nova Lake and Diamond Rapids, by the end of this year) and for renaming the 16 (AVX) or 32 (AVX-512) vector registers.

Both register files have a few hundred of scalar, respectively vector registers.

Besides these 2 big register files, there are a few other registers for renaming some special registers, e.g. the flags register and the AVX-512 mask registers.

Between the general-purpose registers there are no renaming differences, any of the 16 registers can be mapped to any of the hundreds of hidden registers, regardless if the register name used in the program is RAX, RCX or whatever.

Some differences between apparently similar instructions may be caused not by the fact that they use RAX or another register, but by whether they affect the flags or not, because the number of renaming registers available for flags is much smaller than the hundreds available for GPRs.

JonChesterfield · 2026-01-28T20:58:03 1769633883

It'll go much faster if you give each process a warp instead of a thread. That means each process has its own IP and set of vector registers, and when your editor takes a different branch to your browser, no cost.

JonChesterfield · 2026-01-28T20:53:53 1769633633

Merely mislead by marketing. The x64 arch has 512bit registers and a hundred or so cores. The gpu arch has 1024bit registers and a few hundred SMs or CUs, being the thing equivalent to an x64 core.

The software stacks running on them are very different but the silicon has been converging for years.

JonChesterfield · 2026-01-24T11:24:08 1769253848

Relatively successful was slotting the cat5 jacket, cutting off two or three of the pairs, twisting/tying the remaining pair to the old wire, then sliding the jacket back over the join before wrapping in a conservative amount of electrical tape. You want the join to be similar width to the cable and preferably flexible.

I have a suspicion that pulling fishing line first is the right play if you can manage to connect it to the old wire. Flexible, very high tensile strength, small.

retired · 2026-01-24T11:58:53 1769255933

In addition, in one room I ran two CAT5E cables as there was conduit along the entire way. So I took a CAT5E cable double the length of the conduit, stripped the outer sheath in the middle, folded the cable to get a loop and then attached the phone cable to that using the individual inner wires. Plus tape.

JonChesterfield · 2026-01-24T11:20:35 1769253635

Pulling cables through walls is really easy for some construction styles and really difficult for others.

Can involve taking up floorboards and drilling horizontally through beams, plumber style. Or cutting slots in masonry with angle grinders. Sometimes there are existing wires you can tie to and pull through, sometimes the existing wires were stapled to the walls.

On the bright side everything about the ethernet wires and connections is trivial. Like demo to a friend in 20 minutes and let them walk off with the toolbox and they'll be fine wiring their house, if the construction style is amenable.

retired · 2026-01-24T11:53:06 1769255586

Agreed. I tugged on each phone wire a to see if they were free. And I got lucky on all of them.

One of the problems I had was a kinked conduit where concrete was poured on top, or at least that is what I assumed. Was a bit difficult to get the “knot” (where the phone wire was connected to the CAT5E) through that spot.

consp · 2026-01-24T12:00:46 1769256046

The twisted pair (should be two but one pair is broken...) installed in the 60s in my home are so stuck you will never, ever, get those out without ripping the wall apart. Originally the coaxials should have gone through the same pipes, as there should be enough space, but there is so much gunk in there it was impossible and they layed out a new tube though the floors and ceilings in the corner. For fun and because institutional knowledge is for suckers, they tried the same with fiber and simply gave up so now we are in limbo because computer says we have fiber but we don't.

kalleboo · 2026-01-24T17:35:38 1769276138

In the 70's house I bought, some of the coax and power was literally cast into the concrete.

JonChesterfield · 2026-01-24T11:10:29 1769253029

Lots of sympathy with this plight. Great to hear that someone has done the needful and rendered MoCA style modems over pairs of copper. I'm probably a customer for that.

I'm currently running MoCA over spliced coax as part of the local connection and not amused by the 5ms latency on it. Also running 100mbit over cat3 I found in a wall which does work, but cat3 in another wall can't hold 10mbit. That link actually can hold 70mbit of vdsl but after a nearby lightning strike slagged various hardware I've moved the vdsl modem back to the BT wires entry point and run the output through some fibre.

And there's a wifi bridge between two other points. And some ethernet running outside the building. Previously also ethernet-over-mains that I might bring back now that I've learned what spanning tree protocols are so the periodic reboots they inexplicably require can be tolerated transparently.

Also the connection to the internet itself is crap so bonding vdsl, starlink and 5g through the openmptcprouter project. Just lots of redundancy and self healing hacks all over the place to give an observably solid connection.

Which is a rambling way to say that if you're in Britain and your network connection brings you sorrow, it can be forced to be acceptable with application of more time and money than other countries require.

Nextgrid · 2026-01-24T11:40:39 1769254839

> the periodic reboots they inexplicably require

I've had powerline adapters with uptimes measured in years (basically in between power cuts). I think yours might be defective. They absolutely do not require reboots.

db48x · 2026-01-24T22:31:16 1769293876

Certain brands of g.fast and g.hn devices are apparently so buggy as to be essentially defective, yes.

JonChesterfield · 2026-01-22T18:01:54 1769104914

I see a lot of references to `device_map="cuda:0"` but no cuda in the github repo, is the complete stack flash attention plus this python plus the weights file, or does one need vLLM running as well?

JonChesterfield · 2026-01-02T20:07:53 1767384473

I think if you hit full path coverage in each of them independently and run all the cases through both and check they're consistent you're still done.

Or branch coverage for the lesser version, the idea is still to generate interesting cases based on each implementation, not based solely on one of them.

wizzwizz4 · 2026-01-02T22:06:01 1767391561

If the buggy implementation relies indirectly on the assumption that 2^n - 1 is composite, by performing a calculation that's only valid for composite values on a prime value, there won't be a separate path for the failing case. If the Mersenne numbers don't affect flow control in a special way in either implementation, there's no reason for the path coverage heuristic to produce a case that distinguishes the implementations.

JonChesterfield · 2026-01-01T00:09:41 1767226181

You run equivalent or equal calculations simultaneously on N computers and take majority wins, aircraft control or distributed filesystem style.