More

taktoa · 2025-10-09T03:30:21 1759980621

I think what you're describing is SPMD, which is a compilation strategy, not a hardware architecture. I am not sure but I think SIMT is SIMD but with multiple program counters (1 per N lanes) to enable some limited control flow divergence between lane groups.

AlotOfReading · 2025-10-09T03:50:38 1759981838

The PC is shared in traditional SIMT, but diverging branches are masked out until they execute. Nvidia introduced per-thread PCs with Volta. I think AMD still uses a shared PC across each wavefront?

taktoa · 2025-09-19T07:33:27 1758267207

Rust and C++ implement generics with monomorphization rather than boxing, so there is a potential performance hit associated with a type like this in Java that is guaranteed not to exist in Rust.

In practice, the JVM may still monomorphize it, but it is not guaranteed to, and this would be a good reason to avoid unnecessary uses of generics in a high performance codebase like a kernel, if you chose to write one in Java.

lock1 · 2025-09-19T09:04:01 1758272641

Sure, I guess that's worth mentioning in the context of the original post. It's true that Java implementation of parametric polymorphism has a performance drawback compared to Rust or C++. And it's certainly a bad idea to use Java generics without considering the drawback in hot code paths.

But GP described something they wanted from a type system and basically said container with `Functor`-like behavior is not possible to do in Java. It's possible, albeit with a performance drawback and a bit more clunky to work with compared to Rust, Haskell, or a language with native HKT support.

wavemode · 2025-09-19T12:35:00 1758285300

the parent commenter states that Java is their everyday language, so in this context I don't think we're talking about performance, nor about the needs of the kernel.

taktoa · 2025-07-08T00:13:54 1751933634

Not just when the codomain is a field, but more generally when the codomain is itself a vector space. The former is a special case of the latter where you construct a 1D vector space from a field.

taktoa · on Aug 31, 2024

Definitely not an ARM situation.

taktoa · on March 28, 2024

> In a clocked design the clock signal needs to be routed to every element on the chip which requires a lot of power, the more so the higher the frequency is.

Clock only needs to be distributed to sequential components like flip flops or SRAMs. The number of clock distribution wire-millimeters in typical chip is dwarfed by the number of data wire-millimeters, and if a neural network is well trained and quantized activations should be random, so number of transitions per clock should be 0.5 (as opposed to 1 for clock wires), meaning that power can't be dominated by clock. The flops that prevent clock skew are a small % of area, so I don't think those can tip the scales either. On the other hand, in asynchronous digital logic you need to have valid bit calculation on every single piece of logic, which seems like a pretty huge overhead to me.

HarHarVeryFunny · on March 28, 2024

There's obvious potential savings in not wasting FLOPs recalculating things unnecessarily, but I'm not sure how much of that could be realized by just building a data-flow digital GPU. The only attempt at a data-flow digital processor I'm aware of was AMULET (by ARM designer Steve Furber), which was not very successful.

There's more promise in analog chip designs, such as here:

https://spectrum.ieee.org/low-power-ai-spiking-neural-net

Or otherwise smarter architectures (software only or S/W+H/W) that design out the unnecessary calculations.

It's interesting to note how extraordinarily wasteful transformer-based LLMs are too. The transformer was designed part inspired by linguistics and part based on the parallel hardware (GPU's etc) available to run it on. Language mostly has only local sentence structure dependencies, yet transformer's self-attention mechanism has every word in a sentence paying attention to every other word (to some learned degree)! Turns out it's better to be dumb and fast than smart, although I expect future architectures will be much more efficient.

taktoa · on Sept 29, 2023

The open source release of XLA predates Lattner's tenure at Google by 7 months, and it definitely existed before that -- the codebase was already 66k SLOC at that point. During his tenure it went from 100k SLOC to 250k SLOC. It's now 700k SLOC. He also has, as far as I can tell, zero commits in the XLA codebase. "Of LLVM fame" would be more accurate I think.

swyx · on Sept 29, 2023

my bad - i guess i was just saying he led that team but didnt mean to imply he originated it

you seem to have very precise knowledge of the SLOC at a point in time - just curious is there any tooling you used to do that? that can be pretty nifty to pull out on occasion

taktoa · on Sept 29, 2023

I git cloned the repo and then ran sloccount after checking out various commits (just did `git log | grep -C3 'Jan 1 [0-9:]* 2017'` or similar to find the relevant commits)

swyx · on Sept 29, 2023

ha, simple enough. thx

taktoa · on Jan 20, 2023

I'm pretty sure those numbers are for training, not inference. I've run it on _CPU_ and gotten ~1 token per second.

taktoa · on July 23, 2022

Spanish flu is an orthomyxovirus, not a coronavirus. There are some cold-causing coronaviruses though.

nicoburns · on July 23, 2022

My understanding is that we don’t know for sure wha kind of virus caused Spanish flu (it occurring before our ability to analyse this kind of thing). What are you badi by your assertion that it was an orthomyxovirus on?

doktorhladnjak · on July 23, 2022

Scientists have been able to retrieve genetic material from the remains of those who are confirmed to have died during that pandemic. They've effectively sequenced the genome even to know how it relates to other flu viruses https://www.news.vcu.edu/article/Genetic_sequencing_of_deadl...

You might be thinking of the 1889-90 "flu" pandemic which has been theorized to be from a coronavirus known now as OC43, but it's not certain.

taktoa · on July 18, 2022

The heat emitted by burning fossil fuels is completely irrelevant compared to the greenhouse impact. We burn about 11.7 gigatons of oil equivalent a year, which over 200 years would be 2.7 * 10^16 kWh. The greenhouse effect increase over the last 20 years leads to over 2.2 * 10^15 kWh/yr added to the planet, over 20x as much.

taktoa · on July 4, 2022

I feel like something important here that's been overlooked is how easy it is to get the driver into the socket. This places a very real limit on how fast a large number of screws can be driven. Torx has fairly small tolerances and can be annoying to get in; I've never used Robertson but it seems easier to get in.

asah · on July 4, 2022

Can confirm. Discovered Robertson on some deck boards, matched the size and they popped out effortlessly.