Are these the facts? - You are using a container orchestrator like Kubernetes - ...

ekzhang · on Dec 2, 2024

We don't use Kubernetes to run user workloads, we do use gVisor. We don't use MIG (multi-instance GPU) or MPS. If you run a container on Modal using N GPUs, you get the entire N GPUs.

If you'd like to learn more, you can check out our docs here: https://modal.com/docs/guide/gpu

Re not using Kubernetes, we have our own custom container runtime in Rust with optimizations like lazy loading of content-addressed file systems. https://www.youtube.com/watch?v=SlkEW4C2kd4

ec109685 · on Dec 2, 2024

If the nvidia driver has a bug, can one workload access data of another running on the physical machine?

E.g. it came up in this thread: https://news.ycombinator.com/item?id=41672168

JoshuaJB · on Dec 2, 2024

Yes. The kernel has access to data from every workload, and so technically a bug in _anything_ running at kernel level could result in data leakage.

doctorpangloss · on Dec 2, 2024

Suppose I ask for two H100s. Will I have GPU P2P capabilities?

ekzhang · on Dec 2, 2024

Yep! This is something we have internal tests for haha, you have good instincts that it can be tricky. Here's an example of using that for multi-GPU training https://modal.com/docs/examples/llm-finetuning

doctorpangloss · on Dec 2, 2024

Okay, well think very deeply about what you are saying about isolation; the topology of the hardware; and why NVIDIA does not allow P2P access even in vGPU settings except in specific circumstances that are not yours. I think if it were as easy to make the isolation promises you are making, NVIDIA would already do it. Malformed NVLink messages make GPUs fall off the bus even in trusted applications.

thundergolfer · on Dec 2, 2024

Yes it will.

(I work at Modal.)