Great work.
It's a real bowl of fresh air coming from huge framework that use cuda.
So many different cuda version, with each framework using its own, that all rely on a different driver, and everything needs a new version every 3 months and takes ~10G, (and don't even talk about cudnn needing some manual logged-in install).
Here everything is just two files. For embedded system that don't have a GPU it's perfect.
Here the parallelization and vectorization has been done by hand, but there is a glimmer of hope coming from the side of various compiler projects :
Here is an interesting intel project that does the parallelization and vectorization automatically for different architecture that's definitely worth a look :
https://ispc.github.io/ispc.html
Ditto about the CUDA and cuDNN part.
My project that was running fine for the past 4 years just "died" after a colleague's oversight on upgrading the GPU(1080Ti -> 3090) which isn't compatible with the new cuDNN.
It is just too much of a hassle maintaining that *expletive* jargon so I did the wise decision to kill it.
So many different cuda version, with each framework using its own, that all rely on a different driver, and everything needs a new version every 3 months and takes ~10G, (and don't even talk about cudnn needing some manual logged-in install).
Here everything is just two files. For embedded system that don't have a GPU it's perfect.
Here the parallelization and vectorization has been done by hand, but there is a glimmer of hope coming from the side of various compiler projects :
Here is an interesting intel project that does the parallelization and vectorization automatically for different architecture that's definitely worth a look : https://ispc.github.io/ispc.html
For the auto-differentiation when I need performance or memory, I currently use tapenade ( http://tapenade.inria.fr:8080/tapenade/index.jsp ) and/or manually written gradient when I need to fuse some kernel, but Enzyme ( https://enzyme.mit.edu/ ) is also very promising.
MPI for parallelization across machines.