coreysa's comments

coreysa · on Nov 21, 2014

Almost everyone has fully recovered at this point. If you are still seeing problems with your Virtual Machine after the incident earlier this week, we want to help you!! Please send mail to azcommsm@microsoft.com and email me directly at corey.sanders@microsoft.com.

Please send with high importance so it pops in our inbox and we will dig in.

coreysa · on Nov 20, 2014

Keith, you can find more details in the RCA that is published here: http://azure.microsoft.com/blog/2014/11/19/update-on-azure-s.... It is updated with more details on the flighting and issues we encountered.

coreysa · on Nov 20, 2014

First, I am really sorry about the impact the incident had on your service. Specifically, for your changing IP, are you currently using the reserved IP feature. You can reserve both your external IP and your internal IP.

External IP: http://msdn.microsoft.com/en-us/library/azure/dn690120.aspx

Internal IP: http://msdn.microsoft.com/en-us/library/azure/dn630228.aspx

coreysa · on Nov 20, 2014

Hey nnx, this is Corey from the Azure engineering team. We have a standard protocol in the team of applying production changes in incremental batches. Due to an operational error, this update was made across most regions in a short period of time. I really apologize for the disruption.

keypusher · on Nov 20, 2014

So, you had a bug in your code. That happens to everyone and I think we all understand. However, there are a number of other issues here which seem systemic and much more troubling. First, that your "flighting" did not catch the problem. Why was that? If the bug caused an infinite loop on all the live storage systems, that seems like it should have been fairly obvious on the customer systems you tested on. Second, that the patch was rolled out to all servers at the same time. You have admitted this was a mistake, but honestly it looks like amateur hour. If you are running business critical distributed cloud infrastructure, you just don't ever do this. Third, that there was extended fallout from rolling the patch back. If there are still customers experiencing downtime from this problem a full day later, that speaks to some serious flaws in the ops architecture and process. If you guys want to compete with AWS and similar platforms, it seems like you have a long way to go still. This set of mistakes should haunt you for a long time, because it's going to come up whenever someone is trying to convince their boss/colleague/team that Azure is a solid solution.

coreysa · on Nov 20, 2014

Thanks. We are continuing to investigate this and driving needed improvements in our process and technology to avoid similar issues in the future.

ohyesyodo · on Nov 20, 2014

The last two times there was a big issue the same thing happened with the status dashboard (it became inaccessible). I remember the same issue when the certs expired 1,5 years ago. I really like Microsoft and was convinced "you" would somehow isolate the dashboard and host it separately, but it turns out I was wrong. Do you happen to know the reasons for hosting the status dashboard inside of Azure? It seems so counter-intuitive to me. Or is it actually hosted externally but died due to the load when the issue started to appear?

The OP mentions that Microsoft representatives gave info via public forums. When the issue appeared I looked in different places trying to find info, but only I found was a statement saying that We are aware of issues. I looked at Azure twitter/blog, ScottGu twitter/blog, Hanselmans, MSDN forums. I also tried this forum and reddit. Do you know where I should have gone to receive details?

coreysa · on Nov 20, 2014

Thanks. The communications and the service health dashboard are two areas that we are creating improvement plans from the learning of this event. For the dashboard, we do expect it to continue to run even through outages like this one, but we did encounter an issue with our fallback mechanism that we need to understand more deeply.

For general communications, we did most of our early communication on the event using twitter, announcing the incident and giving updates. We need to build a more formal multi-pronged approach to communicating, including faster responses in the MSDN forums and here in HN to make sure we are reaching as many of our customers and partners as possible. Thanks again for the feedback!!

je42 · on Nov 20, 2014

^This

coreysa · on Nov 20, 2014

Hi, this is Corey Sanders, an engineer on the Azure compute team. Yes, our normal policy for updates is to roll them in incremental batches. In this case, due to an operational error, we did not apply the changes as per normal policy.