It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.
Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.
I believe this is correct. For an H100, the 4 NVLink switches each have 64 ports supporting 25GB/s each, and each GPU uses a total of 18 ports. This gives us 450GB/s bandwidth within the node. But once you start trying to leave the node, you're limited by the per-node InfiniBand cabling, which only gives you 400GB/s out of the entire node (50GB / GPU).
I had this problem and complained to AT&T support. I know it doesn't make sense to replace the modem because this is a software, not hardware, issue but they very quickly offered to replace the modem and now my DNS isn't getting hijacked. Would recommend trying it!
Wonder if there's been a quite hardware-rev and they can't/don't want to update the firmware on the old units. I ran into that once on Spectrum - I bought the "same" modem to replace one, and suddenly my IPv6 config was borked.
Turns out Spectrum (in my region) actually pushes in their config file to disable IPv6, even though their dual-stack network works great, and has been working for at least 8-years now. Some modems apparently "override" that directive (e.g. ignore it and try to configure the IPv6 stack anyways) and you get fully functional IPv6 service. Other modems play goody-two-shoes and you're stuck with only IPv4. The new modem I bought was sold/marketed as the same model but was internally a totally different radio chipset. It was pulling a different firmware rev which had evidently been patched to actually obey the IP provisioning mode.
Spectrum support told me, basically, that if I have working IPv4 connectivity then my service is considered functional and there is nothing they can do. I gave up playing the support game and ended up exchanging until I landed on a Motorola modem that gleefully ignores that config parameter.
I wish I knew how to actually state my case to someone at Spectrum with the authority to actually fix their busted provisioning profiles, because it's kind of crazy to me that they've basically bifurcated IPv6 in this market based on whether or not your modem feels like reading the whole config file ;-P. (What's even funnier is the modems they install must be spec non-compliant, because IPv6 at the office works fine and that's their leased equipment.)
I found that using "helm template" to convert every Helm chart into yaml, and then using Pulumi to track changes and update my clusters (with Python transformation functions to get per-cluster configuration) made my life so much better than using Helm. Watching Pulumi or Terraform watch Helm watch Kubernetes update a deployment felt pointlessly complicated.
The main thing is that it's still kicking, Ksonnet is sadly dead.
In addition to that it supports importing Helm charts, has a blessed convention for multiple environments and several native Jsonnet functions that make things a bit nicer.
This is both a good thing and a bad thing, but Pulumi is way more flexible than Terraform. I wanted to have a cloud-provider-specific submodule that created resources (like EKS and GKE) that then exported output values (think kubeconfig), and then I wanted the parent module to pass those in as inputs to a cloud-provider-independent submodule. Terraform couldn't do it without needing to duplicate a ton of code, or without something heinous like Terragrunt (not sure it even would have worked.) Pulumi makes it trivial and in a language I like writing.
Additionally, our applications consume our cloud configuration (eg something that launches pods on heterogenous GPUs needs to know which clusters support which GPUs, our colo cluster has H100s but our Google cluster has A100s etc.) Writing in the same language in the same monorepo makes it very easy to share that state.
No, it is not. That's the sparse fp8 flop number, but you need to ignore sparsity and compare bf16 flops not fp8 flops for the comparison the ancestor post is making.
It's worth noting that just because an H100 has a higher flops number doesn't mean your program is actually hitting that number of flops. Modern TPUs are surprisingly competitive with Nvidia on a perf/$ metric, if you're doing cloud ML they are absolutely worth a look. We have been keeping costs down by racking our own GPUs but TPUs are so cost effective that we need to do some thinking about changing our approach.
I'm not certain but I think part of this is that XLA (for example) is a mountain of chip-specific optimizations between your code and the actual operations. So comparing your throughput between GPU and TPU is not just flops-to-flops.
Great question! Reshape was definitely a source of inspiration, and in fact, our first PoC version was based on it.
We decided to start a new project for a couple of reasons. First, we preferred it to have it in Go, so we can integrated it easier in Xata. And we wanted to push it further, based on our experience, to also deal with constraints (with Reshape constraints are shared between versions).
Myself and a friend have been working on this for quite a while. It's frustrating because it has come so far (showing trail data worldwide interactively on a tiny budget is challenging), and yet it is so far from being something super useful like AllTrails due to low data quality and a lack of relevant features (photos, reviews, etc etc.)
Ahh this is a good idea, my friend in fact has been asking me to add it forever. I've been hesitating since the data seems so sparse but I think you're both right. Thank you for the feedback!