By NVIDIA's own numbers and widely available testing numbers for FP8, the AMD MI355X just edges out the NVIDIA B300 (both the top performers) at 10.1 PFLOPs per chip at around 1400 W per chip. Neither of these thngs are available as a discrete device... you're going to be buying a system, but typically AMD Instinct systems run about 15% less than the comparable NVIDIA ones.
There’s a difference between raw numbers on paper and actual real world differences when training frontier models.
There’s a reason no frontier lab using AMD models for training, because the raw benchmarks for performance for a single chip for a single operation type don’t translate to performance during an actual full training run.
Meta, in particular, is heavily using AMDs for inference training.
Also, anyone doing very large models tend to prefer AMDs because they have 288GB per chip and outperform for very large models.
Outside of these use cases, it’s a toss up.
AMD is also much more aligned with the supercomputing (HPC) world were they are dominant (AMD cpus and GPUs power around 140 of the top 500 HPC systems and 8 of the top 10 most energy efficient)
AMD literally can't make enough chips to satisfy demand because nVidia buys up all the fab capacity at TSMC.