We want to thank you for using FMX to manage your operations. We understand that FMX is often a critical component of running your organization and therefore we take any service disruptions seriously. This postmortem report will help you to better understand what caused the interruption in service as well as how we plan to avoid issues like these in the future.
Root cause of outage:
On 8-26 we discovered high priority bug in our application and determined that it would be disruptive enough to our users that we should deploy mid-week fix. On the evening of 8-26, we attempted to deploy this fix twice and were unsuccessful both times. We later determined that there was a problematic change included in this deployment. During the second deployment attempt, a rare set of circumstances followed: Windows Update restarted the build server mid-deployment, resulting in Azure, our hosting service, continuing the deployment in the background without our knowledge.
As designed, Azure refused to deploy the broken app to live servers due to health probe failures (a protective measure to prevent deploying bad code). After experiencing an hour of health probe failures and having received no additional cancellation attempts from us, Azure, according to its design, deployed the broken application.
Contributing factors:
We rely on alerts to inform us of application and feature outages outside of our core hours from 8 am to 6 pm EDT. While we received alerts, they were in email and instant messaging format. This, coupled with the overnight timing of the outage, delayed the time to respond.
Future mitigations:
As a result of this issue are taking the following actions:
Once more, we deeply apologize for this outage and we will be taking steps to ensure it does not happen again.
Regards,
FMX Team