Back when I was young, which isn't that far ago, one tried to put as much pre-computed stuff into memory as it was much faster than the CPU. Lookup tables left, right and center.
These days you can do thousands of calculations waiting for a few bytes of memory. And not only is the speed difference getting worse, but memory sizes aren't keeping up.
Guess we're not far away from compressing stuff before putting it in memory is something you'd want to do most of the time. LZ4 decompression[1] is already just a factor of a few away from memcpy speed.
The RAM was so terrible that essentially you try to keep the processors running in cache for as long as possible. RAM access is painful.
There is a performance profiling tool built into F3DEX3 that now shows that approximately 70% of the time the system is idle while running Zelda OoT. It is just waiting for memory transfers. The folks at SGI/RAMBUS cut corners a little too hard building that system.
But turns out this kind of performance profile is just prep for were we are heading apparently.
I was re-reminded of this when watching a performance analysis video that may or may not have been posted here (sometimes I get things here, or reddit, but sometimes the real story is in the related videos). It doesn't take a very big lookup table for it to be faster to rerun the calculations.
Especially when you throw multiprocessing in. We need better benchmarking tools that load up competing workloads in the background so you can tell how your optimization really works in production instead of in your little toy universe in the benchmark.
On the latter point, macOS had had compression memory for a long time by now and some Linux distributions also use it out of the box (I don’t know anything about Windows).
One of the time series databases streams compressed blocks in memory when doing searches, doing distinct blocks per core. For some scenarios it's faster to do a table scan than keep extra indexes hot.
Even in the 90's I recall decompressing ZIP files on a 486 being limited by the HDD speed. It felt like even then we would head towards compressed memory systems once it could be compressed quickly enough.
I think quantization for large language models already do something like this - they compress the parameters in memory and then decompress when performing the forward passes
These days you can do thousands of calculations waiting for a few bytes of memory. And not only is the speed difference getting worse, but memory sizes aren't keeping up.
Guess we're not far away from compressing stuff before putting it in memory is something you'd want to do most of the time. LZ4 decompression[1] is already just a factor of a few away from memcpy speed.
[1]: https://github.com/lz4/lz4