> but I can't think of a single one for Apple Silicon.
The post here is exactly one for Apple Silicon. It compared a naive implementation in PyTorch which may not even keep 4090 busy (for smaller/not-that-compute-intensive models having the entire computation driven by Python is... limiting, which is partly why torch.compile gives amazing improvements) to a purposedly-optimized one (optimized for both CPU/GPU efficiency) for Apple Silicon one.
The post here is exactly one for Apple Silicon. It compared a naive implementation in PyTorch which may not even keep 4090 busy (for smaller/not-that-compute-intensive models having the entire computation driven by Python is... limiting, which is partly why torch.compile gives amazing improvements) to a purposedly-optimized one (optimized for both CPU/GPU efficiency) for Apple Silicon one.