As bubblethink already mentioned, the article is arguing that batch processing jobs (e.g. those characteristic of Deep Learning) don't benefit substantially from cloud infrastructure.
High availability, edge processing, fast network upload, and the lego blocks for creating redundancy are moot.
Most training algorithms automatically recover if a node goes down. That covers all required failure handling.
Hi, Lambda engineer here (I'm one of the authors). You bring up some good points, I'd like to address some of them:
Admin:
Time (and cost) it takes to install software, drivers, etc
These tasks aren't made unnecessary by cloud. Yes, with cloud, once you've done the work, it can be forever encoded into an image or container. However, the same applies to on-prem using container solutions like Docker.
Maintenance / Capitalization / Finances:
3-year? What is the useful life of this? When does it seemingly become obsolete?
With GPUs of recent history, obsolescence is not a concern in a 3-year time frame. On AWS, people are still using K80s (released in 2014!). The GTX 1080 Ti, which was released 2.5 years ago, is selling for substantially above MSRP. This may change if competition in the GPU space increases and NVIDIA loses its monopoly.
AWS will continually upgrade their hardware and you keep paying the same
True, but this concern is mitigated by slow rate of GPU obsolescence.
Hardware can be capitalized, which means you can push it to the balance sheet (for tax or valuation purposes)
Can you say a little more about this? Not sure what you're getting at.
Spending $90k instead of $184k in year 1 with the option to turn it off if you want (no longer need). This could be very valuable for a startup who wants elastic spending patterns.
This needs to be evaluated on a case-by-case basis. A poorly-capitalized start-up with an unpredictable compute workload is not a good candidate for buying on-prem. A well-capitalized start-up that consistently uses GPUs is a better candidate.
Returns, breakage, warranty in case of a hardware failure
Any reasonable hardware provider will include an option for 3-year warranty.
>Hardware can be capitalized, which means you can push it to the balance sheet (for tax or valuation purposes)
Can you say a little more about this? Not sure what you're getting at.
I think he means this: that hardware goes to your balance sheet, then it depreciates by a certain amount over those 3 years and that loss may be tax-deductible.
In other words: on your bank account it looks like an upfront cost, but because you could sell the servers at any time they really look more like a rental in your books, with capital slowly draining away each year as they become worth less.
Correct. Depending on what accounting principals you use, this is typically 3-5 years. It's akin to an airlines buying a Boeing plane. It'll cost them $1B let's say, but it'll actually hit their income over 35 years ($1b / 35), which means on their income statement, only ~$28M shows up per year (simplified example). Most companies are valued and taxed on their income, so this is important to understand.
In fact, for most enterprise grade hardware I've ever purchased, the three year warranty tends to be pretty much built into the cost. Usually where I pay extra is to add the fourth and fifth years.
Lambda Labs engineer here. Here's what we're trying to argue:
Many benefits of running infrastructure in the cloud are lost for offline batch processing jobs. Training machine learning models doesn't require low response times, high availability, geographic proximity to clients, etc. Yet, with cloud, you're paying for all this extra infrastructure.
The main benefits of cloud for tasks with machine learning type workloads are cost saving (if utilization is low) and no electrical set-up.
On the other hand, cloud is extremely expensive for groups that require high base levels of GPU compute. The article is arguing that such groups can save a huge amount of money by moving infrastructure on-prem.
I feel I am talking to a brick wall. This often happens and this is what exhausts me. You are still arguing between cloud and colo / on-prem (even ignoring that on-prem and colo is very different but w/e) and what I am saying is that there is a third option neither the article nor your reply even acknowledges.
We wrote debian packages for every framework, including cuda and cudnn. Using our Debian repository, you can install all these frameworks using apt/aptitude.
When a new version of a framework comes out, we usually have it available in 1-2 weeks.
Any chance you can get MxNet (and Keras MxNet) in there?
Installing this stuff is a huge nuisance for us and we have some pretty insane Dockerfiles to handle all the different combinations. I might look into using this for our ML images.
Our company used to run a cluster of ~1000 gpu servers for inference and training. We used caffe and torch. Provisioning and software maintenance was a huge hassle.
We realized how painful it is to get a machine set up for deep learning, so we decided to release our debian packages to the public.
We sell computers built for deep learning researchers and figure the more people we can help get started, the better :)
The author wasn't joking about the noise levels. This machine sounds like an F1 race car.
If you don't require a rack mounted server, a cluster of workstations like NVIDIA's DIGITS DevBox is far more cost efficient (and less noisy). I run a compute intensive business (Dreamscopeapp.com) and we opted to build a cluster of desktop-like machines instead of using a rack mounted solution. Another benefit is you don't run into the power issues mentioned in the post.
So - tried your quote form. The options are 4x 1080ti, 4x titan Xp, or 8x P100 --- but no 8x 1080ti? Or is the quote form wrong?
16.5K seems pretty reasonable for 8x 1080ti with a bit of profit for building it, but unreasonable for only 4x 1080ti. My home-built 4x1080ti box (without quite enough PCIe bandwidth, admittedly) is under $6k. I'm assuming/hoping there's an error there. :)
Oh, also - if I want a quote on both the big server and the little workstation I have to enter my contact info twice? Not particularly customer-friendly.
For quad GPU config you should look at dev box type option. It's $8,750 for machine with 4 1080Ti, 64GB of RAM, and 1TB SATA SSD. Quite a steep margin if you ask me, considering a 128GB RAM machine that you build yourself would cost you at most $5,700 (taxes included) if you get everything from Amazon, and probably under $5k if you're willing to shop around a little.
You can save a ton of money by building your own machine.
The server we sell is packaged with software we wrote that makes administering significantly easier. We also provide technical support and even a limited amount of free machine learning consulting. The customers who purchase this server want a headache free solution and aren't as price sensitive as a lone researcher.
notice the custom part box accounts for two more GPUs, I'm not sure why the site doesn't let you add 4 to the GPU section.
This setup ranges from $5250 with 4 GPU to $3240 with 1 GPU. You might want to bump up the PSU for 4 GPU its currently 1500 watts, which may or may not be enough at max load. The article shows a max of ~2800 watts with 8 GPUs
nice - btw, the Rosewill Tokamak 1500 from newegg is a way to save a few bucks on that build, though it's out of stock (they just had it on sale). It's also 80+ titanium.
I purchased a couple of the machines from these guys for my team and had a great experience. Highly recommended if you have better ways to spend your time than building them yourself.
I ask this out of deep curiosity and by no means with any intent to offend: how does (will?) dreamscopeapp.com make any money? (I don't have an Android device or I'd install it – I'm asking because it's entirely unapparent from your web presence.)
Dreamscope doesn't actually make money :) It's a little under break even. It brings in revenue through a $9.99/mo premium subscription, which gives customers higher resolution images.
Our neighborhood uses Nextdoor (about 950 homes) and it's fantastic for community building. It focused more on personal interaction than hyperlocal news aggregation, but in my opinion that made it more useful.
Neighbors in our community who have lived on the same block for 20 years without speaking are now getting together for block parties and sharing info about babysitters and local crimes. It's actually very useful.
That's correct. The goal of this site is to aggregate all free courses available on line to one place. It has courses from MIT, Harvard, UCBerkeley, Yale, UHouston, etc.