As DigitalOcean's CTO, I'm very sorry for this situation and how it was handled. The account is now fully restored and we are doing an investigation of the incident. We are planning to post a public postmortem to provide full transparency for our customers and the community.
This situation occurred due to false positives triggered by our internal fraud and abuse systems. While these situations are rare, they do happen, and we take every effort to get customers back online as quickly as possible. In this particular scenario, we were slow to respond and had missteps in handling the false positive. This led the user to be locked out for an extended period of time. We apologize for our mistake and will share more details in our public postmortem.
Thanks for the replies.
Let me try to address a few of the things I have seen here. We haven't completed our investigation yet which will include details on the timeline, decisions made by our systems, our people, and our plans to address where we fell short. That said, I want to provide some information now rather than waiting for our full post-mortem analysis. A combination of factors, not just the usage patterns, led to the initial flag. We recognize and embrace our customers ability to spin up highly variable workloads, which would normally not lead to any issues. Clearly we messed up in this case.
Additionally, the steps taken in our response to the false positive did not follow our typical process. As part of our investigation, we are looking into our process and how we responded so we can improve upon this moving forward.
With all due respect I think you’ve missed the point. The larger point from my perspective is that you denied your client the ability to move their data off your platform. This would be akin to someone breaking the terms of their lease and you confiscating all their belongings with the intent of burning them. You should provide some sort of grace period for users to move their data off your platform. For everyone else reading this this is should be a wake up call why you should never trust your data to a singular entity. Even if they have 99.9999% uptime you never know when they’ll decide to deny you access to your data.
Thank you for jumping in personally to clarify what happened.
As a business owner with much of our infrastructure depending on DigitalOcean, the incident is concerning. It affects the reputation of DO as well as its customers.
The demographics on Twitter and especially here on HN represents a sizable crowd with decision-making influence on DO's bottom line. I hope to see some effort being made to prevent situations like this in the future, and to regain the trust.
As a (so far) satisfied customer, it's great to hear that:
> A combination of factors, not just the usage patterns, led to the initial flag.
> We recognize and embrace our customers ability to spin up highly variable workloads, which would normally not lead to any issues.
> we are looking into our process and how we responded so we can improve upon this
I’ll be awaiting the post-mortem and, depending on that and the procedures proposed to stop this from happening again, will hold off moving everything I have from DO.
The real “mess up” here was the bit where you blocked the account with no reason given and no further communication - other than the one-liner your intern wrote for the email.
I’m expecting you to sit down with your legal team and rewrite your TOS to be more customer-focused and less robotic.
I wanted to provide you all with an update on the postmortem I promised on Friday. Our analysis has been completed. We will be sharing the full document soon and will publish a link in this thread for those wanting to read it. We promised Raisup a first look and we have provided the draft document to them this afternoon. Because some information in the document could be considered sensitive we wanted to give Raisup a chance to review the document before sharing with the public.
As a long term customer here is a small suggestion to make this fail-safe: by default do trust your customers, and just ask them first instead of shooting them down first.
Considering you have been marketing yourself as the platform for developer oriented cloud, you should be aware that surge provisioning can and will always be happening.
What do you recommend your clients to do if that kind of mistake happens to them? Is Twitter-shaming the only way out?
I know people say some legal arguments why they close you down and won't say anything, but this is the worst scenario ever. I'd be better off excused at something I didn't do than just ooops we can't tell you anything, your account has been shut down.
This is important. I hate how it had became standard for companies to screw their customers unless they are online-shamed.
The response email even read like a giant polite FUCK YOU (we locked your account, no further action required by you)
You bet I will have further action!
And it is after the shaming that you get an "I am sorry for this situation". Which sounds more like saying "I'm sorry we got caught".
My frustration is not with DO specifically, as they do exactly what every other company does.
But, what of the other thousands of people that got screwed and did not put it on twitter?
It is the equivalent of when you are in a restaurant and get screwed: It is the loudest person that complains more the one that gets the reward, while all the others silently swallow the injustice.
It's most likely due to the fact that the people who can act upon the process itself, not just follow the process inevitably see the issue and do truly want to help.
Getting your message into the right hands is what matters, not the platform it's on.
Mistakes happen, and algorithms are sometimes a necessary part of scale/efficiency. Everyone understands that.
That said, what's highly troubling as a DO customer (and someone who is planning to deploy startup infrastructure of my own with DO) is:
1) The discrepancy between this customer's experience and clear assurances made on this very forum by high-level DO employees that:
a. warnings are ALWAYS issued before suspensions.
b. even in the event of a suspension, services remain accessible (though dashboard access and/or the ability to spin up NEW services may be impacted), ie. the affected customer could still retrieve data or SSH in to droplets.
2) The relatively trivial nature of the customer's offending usage (temporarily spinning up 10 droplets). What happens if, for example, a startup gets a press mention somewhere that leads to a massive traffic spike, necessitating a sudden and significant spin-up of new droplets (especially if this is done programmatically versus by hand in the dashboard)?
3) The apparent lack of consideration of the customer's history, or investigation into their usage. It seems the threshold for suspending services of longstanding customers who are verifiably engaging in commerce (taking a moment to look at their website and general online presence for indicators of legitimacy), should be SUBSTANTIALLY higher than for, say, an account who signed up a week ago. Context matters.
I'm no longer able to edit the above comment, so to elaborate on #1:
Following is a comment[1] by Moisey Uretsky in another thread[2]:
> Depending on which items are flagged the account is put into a locked state, which means that access is limited. However, the droplets for that account and other services are not affected at all. The account is also notified about the action and a dialogue is opened, to determine what the situation is. There is no sudden loss of service. There is no loss of service without communication. If after multiple rounds of communication it is determined that the account is fraudulent, even then there is no loss of service that isn't communicated well in advance of the situation.
What he said in an other thread, and this thread, is press release, marketing. Don't trust what he says to save his business. You have absolutely no reason to.
I prefer to give them the benefit of the doubt, though a clear explanation of why the above policy was not followed seems warranted. (It also doesn't appear to have been followed in several other instances reported by other former customers in various HN threads.)
If DO reserves the right to cut off services and access to your own data permanently and without warning (outside of a court order or confirmed illegal activity), that needs to be unequivocally stated, and the triggering factors should be made known. Otherwise, DO is not fit for production systems.
Additionally, it would be nice to see the creation of a transparent, high-level appeal process for customers affected by suspensions. Truly malicious customers wouldn't use it (what would they hope to successfully argue to an actual human reviewer?), but it would greatly benefit legitimate customers to have an outlet other than social media by which to "get something done" in the event of an inappropriate suspension followed by a breakdown in the standard review process.
Not sure why you’re being downvoted. Point 2 is very relevant. Scaling instances due to sudden peaks should be totally safe. Even when automated. Guess AWS is still lonely at the top.
It really is a trivial amount of resources to have it triggering such a reaction. It's almost like DigitalOcean doesn't like being in the cloud hosting business. One of the fundamental, desirable points to the shift to such cloud hosting services is that you can quickly spin up a bunch of resources when needed and then dump it.
You've got an additional problem though, which is that this tells us you have two support channels: one that doesn't work (i.e. yours, the one you built), and one that does (Twitter-shaming). The first channel represents how you act when no one's watching; the second, how you act when they are. Most people prefer to deal with people for whom those two are the same.
Do not use DO. The very fact that their default response to suspected spam is to cause prod downtime is so bizarre and unacceptable that it does not make any sense whatsoever for a business to rely on them.
You do not need to scrub or write anything to not provide user A’s data to user B in a multi-tenant environment. Sparse allocation can easily return nulls to a reader even while the underlying block storage still contains the old data.
They were just incompetent.
On top of all of that, when I pointed out that what they were doing was absolute amateur hour clownshoes, they oscillated between telling me it was a design decision working as intended (and that it was fine for me to publicize it), and that I was an irresponsible discloser by sharing a vulnerability.
Then they made a blog post lying about how they hadn’t leaked data when they had.
I think it says a lot that this CTO joker flew in, regurgitated the standard-issue "we will endeavor to do better" apology and left without answering any of the very legitimate follow-up questions. I would never deal with an organisation that behaves like these guys.
That’d be unrealistic for any company to claim, and if any company I worked with did claim that I would run for the hills.
That’s akin to saying “we’ll never ship a bug”, or “we have an SLO of 100%”. That’s impossible for anyone to claim. Same goes for the response handling. There is clearly a lot of room for improvement there, but if you’re insisting on not getting canned response, that means a human needs to be involved at some point. Humans will at times be slow to respond. Humans will at times make mistakes. This is just an unavoidable reality.
I get that mob mentality is strong when shit hits the fan publicly, but have a bit of empathy and think about what reasonable solutions you may come up with if you were to be in their situation, rather than asking for a “magic bullet”.
I could see a good response here being an overhaul of their incident response policy, especially in terms of L1 support. Probably by beefing up the L2 staffing, and escalating issues more often and more quickly. L2 support is generally product engineers rather than dedicated support staff/contractors, so it’s more expensive to do for sure, but having engineers closer to the “front line” in responding to issues closes the loop better for integrating fixes into the product, and identifying erroneous behavior more quickly.
Sure, me and a lot of others react rather strongly in these situations. I agree with that but you already seem to understand the reasons.
However, can you say with a straight face that the very generic message left here by DO's CTO instills confidence in you about how will they handle such situations in the future?
Techies hate lawyer/corporate weasel talk. Least that person could do was do their best to speak plainly without promising the sky and the moon.
I would prefer a generic message and a promise for follow up once all the facts are known over a rushed response that may be incorrect.
I’m an engineering manager in an infrastructure team (not at all affiliated with Digital Ocean, tho full disclosure, I do have one droplet for my personal website). I know how postmortems generally work, and it’s messy enough to track down root cause even when it’s not some complex algorithm like fraud detection going off the rails.
I’d rather get slow information than misinformation, but I understand the frustration in not being able to see the inner working of how an incident is being handled.
And I agree with your premise. However, my practice has shown that postmortems are watered-down evasive PR talk, many times.
If you look at this through the eyes of a potential startup CTO, wouldn't you be worried about the lack of transparency?
And finally, why is such an abrupt account lockdown even on the table, at all? You can't claim you are doing your best when it's very obvious that you are just leaving your customers at the mercy of very crude algorithms -- and those, let's be clear on that, could have been created without ever locking down an account without a human approval at the final step.
What I'm saying is that even at this early stage when we know almost nothing, it's evident that this CTO here is not being sincere. It seems DO just wants to maximally automate away support, to their customers' detriment.
Whatever the postmortem ends up being it still won't change the above.
Our line so far has been to change provider of service if we start getting copy - paste answers from support. We always make sure we can get hold of a human on the phone even without a big uptime contract. This has so far lead us to small companies that are not overrun by free accounts used as spam or SEO accounts. That means they have no need for automatic shutdown of accounts and instead you get a phonecall if something goes wrong.
This is how I would go about it as well. But I imagine that's a big expense for non-small companies, and not only through money but through the time of valuable professionals that could have spend the time improving the bottom line.
I too value less known providers. The human factor in support is priceless.
7 hours, on a Friday night in the headquarters time zone. This issue is resolved and is clearly not wide spread, so does getting a response on Monday or Tuesday vs right now make any difference?
Companies are made of people. Let the people have a life. Their night is shitty enough as is after this, I guarantee you.
The thing is, my business don't want to deal with people. It wants to deal with a business made of multiple people to guarantee service availability. If he cannot answer, surely someone else in DigitalOcean can?
You are being unreasonable here. He promised a postmortem. I’d much rather wait a few days to get a clearly written, comprehensive analysis of the problems than to get an immediate stream of confusing and contradictory raw data.
If you have ever been involved in post facto analysis of a process breakdown like this you know how hard it is to get the full picture immediately. Rushing something out does no one any favors.
Sure, but the email he received basically said "your account is locked. No other info. Thank You". That to me is a much scarier thing than anything else in the thread. How can anyone trust in your infrastructure if your standard protocol is literally just shutting down their entire operation without any form of review or communication?
We have a relatively large spend ($5k+) @ DO, for a unique client (most of our other clients can be served by our colocated facility), and I'm going to second this. Or with any other provider. They should always explain exactly which rule was broken. If the customer is legit + genuine, they will promptly fix the issue and won't be a further problem. Being vague makes it super troublesome to rely on any service that takes that tactic. (Like Google, for example) If they continue to re-offend, and find other ways to skirt the rules, that's when you move on to account termination.
You can't, obviously. Even though I've used them before I really doubt I'll ever use DigitalOcean again. I can almost understand terminating customers (with notice) via automated heuristics for suspicious behavior, especially on the low end of the hosting market, but locking out a legitimate paying customer from backups with no notice or recourse is terrifying.
"In this particular scenario, we were slow to respond and had missteps in handling the false positive. This led the user to be locked out for an extended period of time."
This didn't seem like a case of being "too slow" - the customer in question went through your review process (which was slow, yes), and the only response he got was "We have decided not to reactivate your account, have a nice day".
That just seems like a lack of interest in supporting your customers that are falsely flagged.
Last week ended on a real low note for many of us at DO. We took a perfectly good customer and gave them an experience no one should have to go through (all while he was trying to leave on vacation no less). We can and must do better. To do better we need to learn from our mistakes. To that end, we also think sharing the information about this incident openly is the best way to help all our customers understand what happened and what we are doing to prevent it in the future.
Yesterday we completed our postmortem analysis of the incident involving Nicolas (@w3Nicolas) and his company Raisup (@raisupcom). With their permission we are sharing the full report on our blog here:
No offense, as I'm sure this has been hard, but a screwup like this publicly demonstrates DO is not ready for prime time competition against AWS, Azure, GCP and the like.
I'd gladly do whatever it takes to KYC, send you my business license, tax returns, EIN, invoice billing, etc so you know there is someone behind my account.
We spend thousands of hours eliminating single points of failure. If an automated system can undermine that work, DO is not an option for us to host anymore.
A year of Data backup lost. Do you realize how that alone may cause the clients to dump a company and do you realize that startups may never recover from fiasco like these? I understand that it was false positive triggered by internal systems. But how do you explain the delay in restoring the services and reflagging again within hours after the services were restored?
Hope you can share what you learnt from this incident and hopefully you'll take a hard look at your processes.
I'd hate to be caught in the same issue, especially that we are already customers, and I'm not sure I'll have as much clout as Nicolas here to get your attention.
> and I'm not sure I'll have as much clout as Nicolas here to get your attention.
It's occurring to me now that while I've successfully ignored twitter for years, I should probably rectify that just so I have somewhere to type my hopes and prayers when this eventually happens to me, and hope for a miracle. It sure seems like the only place they're listened to.
> I'm not sure I'll have as much clout as Nicolas here to get your attention.
Maybe keeping a twitter (and other social media) account with at least a certain number of follower should be considered a part of a company's security strategy? You'd also need to post something interesting periodically, to keep your follower, so that you have their attention when you need it.
IANAL, but DO's ToS is loaded with weasel words.[0] So if they can sue in some jurisdiction where the binding arbitration and liability limitations don't apply, maybe they could at least get a fair settlement.
It would probably be worth it to restore trust: Refund all the money they've taken from this company for the last year, and apply a credit to their account for 3x that amount, say.
It's not the false positive that is the issue here. The issue is that a. it took way too long to get the business back up and running, and b. the second response gave no explanation and no recourse for the business to become operational again.
The very fact that this can happen from an automated script with no oversight should give every one of your customers pause as to whether they continue with your service.
I'd say the issue is that DO is shutting down servers for any reason at all (legal issues aside). If DO sells a product with a particular capacity, why should they intervene at all if a user is using all of the capacity they're paying for?
So unless a person is popular enough to get enough people talking about it on twitter or hacker news, someone whose account is flagged by your bad script is going to lose his business.
Are you aware that Viasat has blacklisted a huge amount of Digitalocean /24 subnets? I can't access many of my servers when I'm on a satellite connection in addition to other websites hosted on Digitalocean. I've talked with the Viasat NOC and they told me they were blocking Digitalocean subnets due malware.
This is probably worth it's own post, it would be very interesting to see more detail. I'm also probably certain that this is also not exclusive to DO.
Should we be concern about our 40+ droplets with DO now? We built our business on DO, we really can go bankrupt as well as our 30+ clients if anything like this happens to us. Please change your support system ASAP otherwise we will be switching to another platform. We are expecting a very serious response from you.
Do not use DO. The very fact that their default automated response to spam is prod downtime is unacceptable.
It requires so many failures in understanding the service being provided across the company for this decision making process to have ever actualized that there is no reasonable expectation of safety or trust from DO at this point.
Every cloud company has anti abuse systems that will limit your access to their APIs / take down your machines if abuse is suspected - for example if it looks like you're mining bitcoin. Your prod isn't any different from your staging for them
Clearly you should be doing regular backups of everything, and not on DO. And make sure to test your backups. And make sure you have a fast migration plan into another cloud.
Ideally you should be cloud-agnostic, but that's quite hard to achieve.
That’s all well and good. But how do you plan to reimburse your customer for this gross negligence? I have never heard of such incompetence or lack of communications from anyone on AWS’s business support plan. Why should anyone trust DO over AWS or Azure?
If your DO (or other cloud provider) credentials are compromised, it's usually a matter of seconds before someone fires up the largest possible number of instances to start crypto mining.
Do you realize that by abusing this thread to make a single PR focused comment with no intention of participating in the conversation -- you've disrespected the community here and the few remaining DO customers within said community.
>> and we take every effort to get customers back online as quickly as possible. In this particular scenario, we were slow to respond and had missteps in handling the false positive.
You clearly don't make every effort, and did not -- so why waste the extra verbiage and switch from active to passive voice?
Based on your cliche response I have zero confidence that DO will do anything substantial to address the root causes of the issue.
I've found DO's public posts to be particularly grating in the "we are listening to YOU, our customer. we take feedback extremely seriously" department.
This situation occurred due to false positives triggered by our internal fraud and abuse systems. While these situations are rare, they do happen, and we take every effort to get customers back online as quickly as possible. In this particular scenario, we were slow to respond and had missteps in handling the false positive. This led the user to be locked out for an extended period of time. We apologize for our mistake and will share more details in our public postmortem.