Hacker Newsnew | past | comments | ask | show | jobs | submit | aschleck's commentslogin

Atherton closed their station in 2020 and, through a great twist of irony, it will become a museum for the line that's still alive next to it: https://inmenlo.com/2025/06/16/community-interest-meeting-on...


It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.

Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.


Try running a single node all to all with sharp disabled. I don’t believe you’ll see 450GB/s.


Yes, 450GB/s is the per GPU bandwidth in the nvlink domain. 3.2Tbps is the per-host bandwidth in the scale out IB/Ethernet domain.


I believe this is correct. For an H100, the 4 NVLink switches each have 64 ports supporting 25GB/s each, and each GPU uses a total of 18 ports. This gives us 450GB/s bandwidth within the node. But once you start trying to leave the node, you're limited by the per-node InfiniBand cabling, which only gives you 400GB/s out of the entire node (50GB / GPU).


Is it GBps (gigabytes per second) or Gbps (giga bits per second)? I see mixed usage in this comment thread so I’m left wondering what it actually is.

The article is consistent and uses Gigabytes.


GBps


Bazel makes easy things hard and impossible things possible


Perl had a similar motto ("Making Easy Things Easy and Hard Things Possible")

https://books.google.com.sb/books?id=3Fc4DQAAQBAJ


I had this problem and complained to AT&T support. I know it doesn't make sense to replace the modem because this is a software, not hardware, issue but they very quickly offered to replace the modem and now my DNS isn't getting hijacked. Would recommend trying it!


Wonder if there's been a quite hardware-rev and they can't/don't want to update the firmware on the old units. I ran into that once on Spectrum - I bought the "same" modem to replace one, and suddenly my IPv6 config was borked.

Turns out Spectrum (in my region) actually pushes in their config file to disable IPv6, even though their dual-stack network works great, and has been working for at least 8-years now. Some modems apparently "override" that directive (e.g. ignore it and try to configure the IPv6 stack anyways) and you get fully functional IPv6 service. Other modems play goody-two-shoes and you're stuck with only IPv4. The new modem I bought was sold/marketed as the same model but was internally a totally different radio chipset. It was pulling a different firmware rev which had evidently been patched to actually obey the IP provisioning mode.

Spectrum support told me, basically, that if I have working IPv4 connectivity then my service is considered functional and there is nothing they can do. I gave up playing the support game and ended up exchanging until I landed on a Motorola modem that gleefully ignores that config parameter.

I wish I knew how to actually state my case to someone at Spectrum with the authority to actually fix their busted provisioning profiles, because it's kind of crazy to me that they've basically bifurcated IPv6 in this market based on whether or not your modem feels like reading the whole config file ;-P. (What's even funnier is the modems they install must be spec non-compliant, because IPv6 at the office works fine and that's their leased equipment.)


What modem did they replace it with?


I found that using "helm template" to convert every Helm chart into yaml, and then using Pulumi to track changes and update my clusters (with Python transformation functions to get per-cluster configuration) made my life so much better than using Helm. Watching Pulumi or Terraform watch Helm watch Kubernetes update a deployment felt pointlessly complicated.


I do the same with Tanka + Jsonnet, definitely a million times better than dealing with Helm itself or god forbid, letting it apply manifests.


[dead]


The main thing is that it's still kicking, Ksonnet is sadly dead.

In addition to that it supports importing Helm charts, has a blessed convention for multiple environments and several native Jsonnet functions that make things a bit nicer.


Interesting, thanks.


What about using terraform instead of Pulumi? Why did you pick Pulumi for this?


This is both a good thing and a bad thing, but Pulumi is way more flexible than Terraform. I wanted to have a cloud-provider-specific submodule that created resources (like EKS and GKE) that then exported output values (think kubeconfig), and then I wanted the parent module to pass those in as inputs to a cloud-provider-independent submodule. Terraform couldn't do it without needing to duplicate a ton of code, or without something heinous like Terragrunt (not sure it even would have worked.) Pulumi makes it trivial and in a language I like writing.

Additionally, our applications consume our cloud configuration (eg something that launches pods on heterogenous GPUs needs to know which clusters support which GPUs, our colo cluster has H100s but our Google cluster has A100s etc.) Writing in the same language in the same monorepo makes it very easy to share that state.


With Pulumi you really are programming infrastructure in a language of your choice. HCL is a bad joke in comparison.


i am hearing this more and more from folks.


1979 16 bit flops on an H100 is with sparsity. See footnote 2 on https://www.nvidia.com/en-us/data-center/h100/. You should be halving it for non-sparse flops.


GP is correct. With sparsity it is 3958. 1979 Tflop/s is without sparsity.


No, it is not. That's the sparse fp8 flop number, but you need to ignore sparsity and compare bf16 flops not fp8 flops for the comparison the ancestor post is making.


It's worth noting that just because an H100 has a higher flops number doesn't mean your program is actually hitting that number of flops. Modern TPUs are surprisingly competitive with Nvidia on a perf/$ metric, if you're doing cloud ML they are absolutely worth a look. We have been keeping costs down by racking our own GPUs but TPUs are so cost effective that we need to do some thinking about changing our approach.

I'm not certain but I think part of this is that XLA (for example) is a mountain of chip-specific optimizations between your code and the actual operations. So comparing your throughput between GPU and TPU is not just flops-to-flops.


Cool stuff! Do you have any thoughts about how this compares to https://github.com/fabianlindfors/reshape?


Great question! Reshape was definitely a source of inspiration, and in fact, our first PoC version was based on it.

We decided to start a new project for a couple of reasons. First, we preferred it to have it in Go, so we can integrated it easier in Xata. And we wanted to push it further, based on our experience, to also deal with constraints (with Reshape constraints are shared between versions).


Luiz called large meetings "all-heads" instead of "all-hands" meetings. "Why? Because we're not sailors!"

I never got to interact with him directly but he was a leader worth following. This is a sad day.


A website making trail data from OpenStreetMap easily browsable: https://trailcatalog.org/

Myself and a friend have been working on this for quite a while. It's frustrating because it has come so far (showing trail data worldwide interactively on a tiny budget is challenging), and yet it is so far from being something super useful like AllTrails due to low data quality and a lack of relevant features (photos, reviews, etc etc.)


Have you looked into following the links from OSM elements to Wikidata items and their associated Wikipedia/WikiVoyage articles, Commons images etc.?


Ahh this is a good idea, my friend in fact has been asking me to add it forever. I've been hesitating since the data seems so sparse but I think you're both right. Thank you for the feedback!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: