Newer airports usually try to have space, that's the only thing helping with the physics involved here.
Older airports might have EMAS [1] retrofitted at the ends to help stop planes, but that's designed more for a landing plane not stopping quickly enough (like [2]) - not a plane trying to get airborne as in this case.
"Aurora" at Argonne National Labs is intended to be a bit bigger, but has suffered through a long series of delays. It's expected to surpass Frontier on the TOP500 list this fall once they some issues resolved. El Capitan at LLNL is also expected to be online soon, although I'm not sure if it'll be on the list this fall or next spring.
As others note, these systems are measured by running a specific benchmark - Linpack - and require the machine to be formally submitted. There are systems in China that are on a similar scale, but, for political reasons, have not formally submitted results. There are also always rumors around the scale of classified systems owned by various countries that are also not publicized.
Alongside that, the hyperscale cloud industry has added some wrinkles to how these are tracked and managed. Microsoft occupies the third position with "Eagle", which I believe is one of their newer datacenter deployments briefly repurposed to run Linpack. And they're rolling similar scale systems out on a frequent basis.
On a node-level, usually these are aiming for around 90-95% allocated. Note that, compared to most "cloud" applications, that usually involves a number of tricks at the system scheduling level to achieve.
At some point, in order to concurrently allocate a 1000-node job, all 1000 nodes will need to be briefly unoccupied ahead of that, and that can introduce some unavoidable gaps in system usage. Tuning in the "backfill" scheduling part of the workload manager can help reduce that, and a healthy mix of smaller single-node short-duration work alongside bigger multi-day multi-thousand-node jobs helps keep the machine busy.
Almost no modern systems are running Torus these days - at least not at the node level. The backbone links are still occasionally designed that way, although Dragonfly+ or similar is much more common and maps better onto modern switch silicon.
You're spot on that the bandwidth available in these machines hugely outstrips that in common cloud cluster rack-scale designs. Although full bisection bandwidth hasn't been a design goal for larger systems for a number of years.
LambdaLabs GPU cluster provides internode bandwidth of 3.2Tbps: I personally verified it in a cluster of 64 nodes (8xH100 servers) and they claim it holds for up to 5k GPU cluster. What is the internode bandwidth of Frontier? Someone claimed it's 200Gbps, which, if true, would be a huge bottleneck for some ML models.
Bisection bandwidth is the metric these systems will cite, and impacts how the largest simulations will behave. Inter-node bandwidth isn't a direct comparison, and can be higher at modest node counts as long as you're within a single switch. I haven't seen a network diagram for LambdaLabs, but it looks like they're building off 200Gbps Infiniband once you get outside of NVLink. So they'll have higher bandwidth within each NVLink island, but the performance will drop once you need to cross islands.
I thought NVLink is only for communication between GPUs within a single node, no? I don't know what the size of their switches are, but I verified that within a 64 node cluster I got the full advertised 3.2Tbps bandwidth. So that's 4x as fast as 4x200Gbps, but 800Gbps is probably not a bottleneck for any real world workload.
Frontier runs unclassified workloads. Other Department of Energy systems, such as the upcoming "El Capitan" at LLNL (a sibling to Frontier, procured under the same contract) are used for classified work.
I know Golang has their own implementation of sd_notify().
For Slurm, I looked at what a PITA pulling libsystemd into our autoconf tooling would be, stumbled on the Golang implementation, and realized it's trivial to implement directly.
> I suppose that you could execute systemd-notify but that's a solution that I would not like.
What I did was to use JNA to call sd_notify() in libsystemd.so.0 (when that library exists), which works but obviously does not avoid using libsystemd. I suppose I could have done all the socket calls into glibc by hand, but doing that single call into libsystemd directly was simpler (and it can be expected to exist whenever systemd is being used).
Caveat is that golang is not a good enough actor to be a reliable indicator of whether this interface is supported, though. They’ll go to the metal because they can, not because it’s stable.
Cluster authentication actually remains the sore point for all interactions - for most systems the least-common denominator remains whether one can SSH into a login node. At which point the main job commands - sbatch/squeue - are pretty stable.
There have been some attempts to standardize basic job management APIs in the past - DRMAA being one noteworthy example. Although DRMAA v2 was only ever implemented by Grid Engine, and is effectively an lightly-abstracted version of their internal APIs, that has never really seen first-class adoption by Slurm/PBS/LSF.
For Slurm, the REST API is meant to be the way forward. It punts the authentication problem to, potentially, anything the admins may care to wire up through an Apache / NGINX proxy. And the basic job submission and status APIs have stablizied to the point that a client application should be able to consume nearly any version going forward.
Slurm is endian and word-size agnostic; there's nothing about the POWER9 platform that would be insurmountable for it to run on. There are occasionally some teething issues with NUMA layout and other Linux kernel differences on newer platforms, but this tends to affect everyone, and get resolved quickly.
My understanding is that Spectrum (formerly Platform) LSF was included as part of their proposal.
Older airports might have EMAS [1] retrofitted at the ends to help stop planes, but that's designed more for a landing plane not stopping quickly enough (like [2]) - not a plane trying to get airborne as in this case.
[1] https://en.wikipedia.org/wiki/Engineered_materials_arrestor_... [2] https://en.wikipedia.org/wiki/Southwest_Airlines_Flight_1248