ML workloads definitely cost a lot of money. Even for a preemptible VM, A100 GPUs cost $0.88/hr/GPU. That's $624 a month for a single GPU and only the 40GB model. Want a dedicated 8 GPU machine in the cloud to do training with? That'll run you around 16 grand a month. Do that for 2 years and you may as well have bought the device. Want to do 16/24/40 GPU training? Good luck getting dedicated cloud machines with networking fast enough between them so that MPI works correctly, and prepared to give up your wallet.
Also, that's just compute. What about data? Sure cloud accepts your data cheaply, but they also charge you for egress of that data. Yes you should have your data in more than one location, but if you depend on just cloud then you need it in different AZ which costs even more money to keep in sync and available for training runs.
I think for simple workloads and renting compute for a startup, cloud definitely makes sense. But the moment you try to do some serious compute for ML workloads, good luck and hope you have deep pockets.
The other thing is nVidia try and sell GPUs with similar performance at two very different prices. One price for data centres and a quite different price to kids. If you do the job yourself you can often get away with using the much cheaper gamer grade cards for AI work (unless you need a lot of VRAM), whereas such as AWS can't do that and are required by nVidia to use the considerably more expensive cards. If your workload will fit on a gamer grade card there's no contest on price between an on-prem system and the cloud.
That is a really good point, and the 3090s have a surprising amount of VRAM on them. For many smaller models this is sufficient. However, where I work without going into a lot of specifics, because of the size of the models, the amount of VRAM is crucial, as well as the infrastructure of the PCI lanes connected to it, the speed of the local storage, and the networking between both cards on the same node as well as between nodes.
The moment the model gets to be bigger than the size of any one GPU's VRAM, the higher by orders of magnitude of difficulty in the process of training that model.
Re data, I think egress rates are going to start disappearing over the next few years.
The part that’s always missing with these rent vs buy analyses on HN for some reason is that it’s totally ignoring the opex cost of operating your own hardware which is going to be non 0. Sure, it won’t be quite as expensive (no profit margin) but it’s not an order of magnitude. Additionally, most companies don’t run the HW 24/7 and, if they do, it’s not a level of people they want to hire to support said operations. Not just running it, but you have to invest and grow something that’s not a core competency to get economies of multiple teams loading up the HW.
If the next revolution in cloud comes in to cause companies to onsite the HW again, it’ll look like making it super easy to take spare compute and spare storage from existing companies and resell it on an open market in an easy way. Even still, I think the operational challenges of keeping all that up and running and being utilized at as close to 100% as possible and not focusing on your core business problem will be difficult because you won’t be able to compete with engineering companies that have a core competency in that space.
> The part that’s always missing with these rent vs buy analyses on HN for some reason is that it’s totally ignoring the opex cost of operating your own hardware which is going to be non 0.
Effectively hiring, retaining, evaluating and rewarding competent staff is hard. Even at a big company the datacenter can be a really small world, which makes it hard for your best employees to grow. Things are especially hard when you don't have a tech brand to rely on for your recruiting, and the staff's expertise is far outside the company's core business, making it harder to evaluate who's good at anything.
> Re data, I think egress rates are going to start disappearing over the next few years.
I'm not sure why you think that. AWS hasn't budged on their egress pricing for a decade (except the recent free tier expansion), despite the underlying costs dropping dramatically. GCP and Azure have similar prices.
Fact is, egress pricing is a moat. Cloud providers want to incentivize bringing data in (ingress is always free) and incentivize using it (intra-DC networking is free), but disincentivize bringing it out. If your data is stuck in AWS, that means your computation is stuck in AWS too.
Disclosure: I work on Cloudflare on R2 so I’m a bit biased on this.
I think we’re going to put real pressure for traditional object storage rates to come down. Since Cloudflare‘a entire MO is running the network with zero egress. As we expand our cloud platform it seems inevitable that you will at least have a strong zero egress choice and if we do a good job Amazon et all will inevitably be forced to get rid of egress. Matthew Prince laid out a strong case for why either scenario is good for us in a recent investor day presentation (either we cannibalize S3’s business and R2 becomes a massive profit machine for us because they refuse to budge on egress or Amazon drops egress which is an even larger opportunity for us).
Products like Cache Reserve help you migrate your data out of AWS transparently from any service (not just S3) - you just pay the egress penalty once per file.
Anyway. I’m not saying it’s going to disappear tomorrow but I find it hard to believe it’ll last another ten years.
> totally ignoring the opex cost of operating your own hardware which is going to be non 0
Early in my career I worked at a company and we had a DC onsite. I remember the months long project to spec, purchase and migrate to a new, more powerful DB server. How much that costed in people-hours, I have no idea. I upgraded to a better DB a couple months ago by clicking a button...
Don't even get me started with ordering more SANs when we ran out of storage or the time a hurricane was coming and we had to prepare to fail over to another DC.
Cloudflare Bandwidth Alliance and R2. S3 felt some pressure just because of our pre launch announcement. It’ll be interesting to see how they adjust over the next couple of years.
They actually probably be paying more that the average listed prices, as Mainframe (basically a on-premise PaaS, where IBM rents you high performance, distributed and redundant hardware cluster on a pay-as-you-use manner) users are often dependent on very high reliability, high uptime and low latency.
The ability to scale up experiments is really nice in cloud. In my experience you need to be quite large before you’re using your own GPUs at a utilization percentage that saves money while still having capacity for large one off experiments.
Also, that's just compute. What about data? Sure cloud accepts your data cheaply, but they also charge you for egress of that data. Yes you should have your data in more than one location, but if you depend on just cloud then you need it in different AZ which costs even more money to keep in sync and available for training runs.
I think for simple workloads and renting compute for a startup, cloud definitely makes sense. But the moment you try to do some serious compute for ML workloads, good luck and hope you have deep pockets.