Interesting, I didn't know this was so sought after.
I actually have one for sale (the 40GB PCIe one), but I haven't got to list it on eBay due to lack of time yet (and because I didn't think there was so much interest in it).
To be sincere, maybe for DL this is really much better than the alternatives, but for some simulations and parallelizing come radiative transfer code, it was not that much better than a RTX 4090 with the extra hassle that it's more difficult to cool it.
The A100 is comparable to the 3090 but with more memory. The H100 is the one comparable to the 4090.
The advantage of these is the access to the larger memory. And they are able to be linked together such that they all share the same memory via NVLink. This makes them scalable for processing the large data and holding the models for the larger scale LLMs and other NN/ML based models.
>And they are able to be linked together such that they all share the same memory via NVLink. This makes them scalable for processing the large data and holding the models for the larger scale LLMs and other NN/ML based models.
GPUs connected with NVLink do not exactly share memory. They don't look like a single logical GPU. One GPU can issue loads or stores to a different GPU's memory using "GPUDirect Peer-To-Peer", but you cannot have a single buffer or a single kernel that spans multiple GPUs. This is easier to use and more powerful than the previous system of explicit copies from device to device, perhaps, but a far cry from the way multiple CPU sockets "just work". Even if you could treat the system as one big GPU you wouldn't want to. The performance takes a serious hit if you constantly access off-device memory.
NVLink doesn't open up any functionality that isn't available over PCIe, as far as I know. It's "merely" a performance improvement. The peer-to-peer technology still works without NVLink.
NVidia's docs are, as always, confusing at best. There are several similarly-named technologies. The main documentation page just says "email us for more info". The best online documentation I've found is in some random slides.
Interesting. So that would mean that you would still need a 40 or 80 GB card to run the larger models (30B LLM, 70B LLM, 8x7B LLM) and perform training of them.
Or would it be possible to split the model layers between the cards like you can between RAM and VRAM? I suppose in that case each card would be able to evaluate the results of the layers in its own memory and then pass those results to the other card(s) as necessary.
You don't need nvlink for inference with models that need to fit in multiple cards. I'm using a laptop with a 3080 ti mobile and a 3090 in an eGPU enclosure for running LLM models over 24GB.
As someone who used various types of GPUs in graduate school. For most simulations, and even machine learning (unless you need the VRAM) you are generally better off going with a consumer card. There is generally about the same number of CUDA cores and the higher clock speeds will generally net you better performance overall.
Simulations where this isn't true is any that need double floating point (which you previously were able to do in the Titan series of consumer-ish cards). And where it is super important for DL is the VRAM it allows you to use much larger models. Plus the added features of being able to string them together and share memory which is an important feature that has been left off consumer cards (honestly in a way that makes sense because SLI has been dumb for some time).
How did you end up cooling it? I have an a40 and it's been interesting testing all kinds of methods, from two 40mm fans to a 3A 9030 centrifugal blower with 3d printed duct
I actually have one for sale (the 40GB PCIe one), but I haven't got to list it on eBay due to lack of time yet (and because I didn't think there was so much interest in it).
To be sincere, maybe for DL this is really much better than the alternatives, but for some simulations and parallelizing come radiative transfer code, it was not that much better than a RTX 4090 with the extra hassle that it's more difficult to cool it.