yeah misunderstanding we'll update the post-- separately it's true that we aren't network specialists and the network wrangling was prob disproportionately hard for us/ shouldn't have taken so long.
Massive props for getting it done anyway. For others reading: In general a switch should never run DHCPd, but will normally/often relay it for you, your arista's would 100% have supported relaying, but in this case it sounds like it might even be flat L2. Normally you'd host dhcpd on a server.
Some general feedback incase it's helpful..
-20K on contractors seems insane if we're talking about rack and stack for 10 racks. Many datacentres can be persuaded to do it for free as part of you agreeing to sign their contract. Your contractors should at least be using a server lift of some kind, again often provided kindly by the facility. If this included paying for server configuration and so on, then ignore that comment (bargin!).
-I would almost never expect to actually pay a setup fee (beyond something nominal like 500 per rack) to the datacentre either, certainly if you're going to be paying that fee it had better include rack and stack.
-A crash cart should not be used for a install of this size, the servers should be plugged into the network, and then automatically configured by a script/IPXE. It might sound intimidating or hard but it's not, doesn't even require IMPI (though frankly I would strongly, strongly recommend it, if you do't already have it). I would use managed switches for the management network too, for sure.
-Consider two switches, especially if they are second hand. The cost of the cluster not being usable for a few days while you source and install a replacement even here probably is still thousands.
-Personally not a big fan of the whole JBOD architecture and would have just filled by boots with single socket 4u supermicro chasis. To each their own, but JBOD's main benefit is a very small financial saving at the cost of quite a lot of drawbacks IMO. YMMV.
-Depending on who you use for GPUs, getting a private link or 'peering' to them might save you some cost and provide higher capacity.
-I'm kind of shocked that FMT2 didn't turn out much cheaper than your current colo, would expect less than those figures possibly with the 100G DIA included (normally about $3000/month no setup).
def agree on the setup fees, that was just a price crunch to get it done within the weekend. (too short-notice for professional services, too sensitive for craigslist, so basically just paying a bunch of folks we already knew and trusted)
for IPXE do you have any reference material you'd recommend? we had 3 people each with reasonably substantial server experience try for like 6 hours each and for whatever reason it turned out to be too difficult.
I have done a ton of iPXE boot setups in the past. We use iPXE at our DC location for imaging, system recovery, etc. In fact, I just finished up a new boot image that creates a 100MB virtual floppy drive used for BIOS updates. Reach out and I can provide the entire setup if you like (pxe config files, boot loaders, scripts, etc).
Similarly I'm happy to share my ipxe scripts. It's just one of those things that you need to understand the fundamentals of before you start. It's about a hundred lines of bash to setup.
Honestly, with 10 servers, a pxe setup is probably overkill. If you're getting used servers (and maybe even if not), you might need to poke them with a KVM to set the boot options so that PXE is an option, and you might want to configure the BMC/IPMI from the console too, and then configure anything for serial over IPMI / bios console on serial ports... do that in your office, since your colo is across the street, and then you may as well do the install too. Then when you install, it should just work and crash cart if not. But, PXE is fun, so...
For PXE / iPXE, there's several stages of boot. You have your NIC's option rom, which might be, but probably is not iPXE. That will hit DHCP to get its own IP and also request info about where to pull boot files. You'll need to give it a tftp server IP and a filename. DHCPD config below
I server iPXE executables to non-iPXE. When iPXE starts up, it again asks DHCP, but now you can give it an http boot script. The simplest thing is to have something like
You can also boot isos, but that's a lot easier if you're in BIOS boot rather than UEFI. Better to practice booting kernels and initrds (unless you need to boot things like firmware update isos)
Then you'll have your installer (or whatever) booted, and you might have an unattended install setup for that, or you can just setup a rescue image that does dhcp (again!) and opens sshd so you can shell in and do whatever. Up to you.
(This is mostly consoldidating bits and pieces from here [1] )
And I have those three files in the root of my tftp server. There's all sorts of other stuff you could do, but this should get you started. You don't really need iPXE either, but it's a lot more flexible if you need anything more, and it can load from http which is gobs faster if you have large payloads.
If you really wanted to be highly automated, your image could be fully automated, pull in config from some system and reconfigure the BMC while it was there. But there's no need for that unless you've got tons of servers. Might be something to consider if you mass replace your disk shelves with 4U disk servers, although it might not save a ton of time. If you're super fancy, your colo network would have different vlans and one of them would be the pxe setup vlan --- new servers/servers needed reimaging could be put into the pxe vlan and the setup script could move them into the prod vlan when they're done. That's fun work, but not really needed, IMHO. Semi-automated setup scales a lot farther than people realize, couple hundred servers at least. autopw [2] can help a lot!
I assume your actual training is being done somewhere else? Did you try getting colocation space in the same datacentre as somewhere with the compute - it would have reduced your internet costs even further.
yeah the cost calculus is very different for gpus, it absolutely makes sense for us to be using cloud there. also hardly any datacenters can support the power density, esp in downtown sf
Yeh; one other thing - you list a separate management network as an optional - it's not optional! Under no circumstance must you expose the managemnt IPs of switches or the servers to the internet; they are, on average, about as secure as a drunk politician. Use a separate management net, make sure it's only securly accessed.
I understood that it's optional because they can walk down the road to the data center instead.
They mention plugging monitors in several times. I think I've only done that once in the last couple of years, when a firmware upgrade failed and reset the management interface IP.
egress costs are the crux for AWS and they didn't budge when we tried to negotiate that we them, it's just entirely unusable for AI training otherwise. I think the cloudflare private quote is pretty representative of the cheaper end of managed object-bucket storage.
obv as we took on this project the delta between our cluster and the next-best option got smaller, in part bc the ability to host it ourselves gives us negotiating leverage, but managed bucket products are fundamentally overspecced for simple pretraining dumps. glacier does a nice job fitting the needs of archival storage for a good cost, but there's nothing similar for ML needs atm.
yeah it's totally plausible that we go with something like this in the future. We have similar offers where we could separate out either the financing, the build-out, or both and just do the software.
(for Hetzner in particular it was a massive pain when we were trying to get CPU quotas with them for other data operations, and we prob don't want to have it in Europe, but it's been pretty easy to negotiate good quotes on similar deals locally now that we've shown we can do it ourselves)
atm we don't and we're a bit unsure whether it's a free lunch wrt adding complexity. there's a really nice property of having isolated hard drives where you can take any individual one and `sudo mount` it and you have a nice chunk of training data, and that's something anyone can feel comfortable touching without any onboarding to some software stack
I wonder if snapraid would work for this. Especially if your data is mostly written once and then just read, it could be an easy way to add redundancy while keeping isolated individual drives.
we're in a pretty unique situation in that very early on we fundamentally can't afford the hyperscaler clouds to cover operations, so we're forced to develop some expertise. turned out to be reasonably chill and we'll prob stick with it for the foreseeable future, but we have seen a little bit of the state-creep you mention so tbd.
yeah we're very interested in trying toploaders, we'll do a test rack next time we expand and switch to that if it goes well.
w.r.t. testing the main thing we did was try to buy a bit from each supplier a month or two ahead of time, so by the time we were doing the full build that rack was a known variable. We did find one drive lot which was super sketchy and just didn't include it in the bulk orders later. diversity in suppliers helps a lot with tail risk
"don't have to screw in every drive" is relative, but at least tool-less drive carriers are a thing now.
A lot of older toploaders from vendors like Dell are not tool-free. If you bought vendor drives and one fails, you RMA it and move on. However if you want to replace failed drives in the field, or want to go it alone from the start with refurbished drives... you'll be doing a lot of screwing. They're quite fragile and the plastic snaps easily. It's pretty tedious work.
yeah colo help has been great, we had a power blip and without any hassle they covered the cost and installation of UPSes for every rack, without us needing to think abt it outside of some email coordination.
not caring about redundancy/reliability is really nice, each healthy HDD is just the same +20TB of pretraining data and every drive lost is the same marginal cost.