Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> but I can't think of a single one for Apple Silicon.

The post here is exactly one for Apple Silicon. It compared a naive implementation in PyTorch which may not even keep 4090 busy (for smaller/not-that-compute-intensive models having the entire computation driven by Python is... limiting, which is partly why torch.compile gives amazing improvements) to a purposedly-optimized one (optimized for both CPU/GPU efficiency) for Apple Silicon one.



The pytorch performance is awful though. You'd have to be kinda crazy to not use an optimized implementation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: