Application outage

Incident Report for FMX System

Postmortem

We want to thank you for using FMX to manage your operations. We understand that FMX is often a critical component of running your organization and therefore we take any service disruptions seriously. This postmortem report will help you to better understand what caused the interruption in service as well as how we plan to avoid issues like these in the future.

Root cause of outage:

On 8-26 we discovered high priority bug in our application and determined that it would be disruptive enough to our users that we should deploy mid-week fix. On the evening of 8-26, we attempted to deploy this fix twice and were unsuccessful both times. We later determined that there was a problematic change included in this deployment. During the second deployment attempt, a rare set of circumstances followed: Windows Update restarted the build server mid-deployment, resulting in Azure, our hosting service, continuing the deployment in the background without our knowledge.

As designed, Azure refused to deploy the broken app to live servers due to health probe failures (a protective measure to prevent deploying bad code). After experiencing an hour of health probe failures and having received no additional cancellation attempts from us, Azure, according to its design, deployed the broken application.

Contributing factors:

We rely on alerts to inform us of application and feature outages outside of our core hours from 8 am to 6 pm EDT. While we received alerts, they were in email and instant messaging format. This, coupled with the overnight timing of the outage, delayed the time to respond.

Future mitigations:

As a result of this issue are taking the following actions:

We are setting up a tiered, automated, calling service when alerts are issued so that we ensure team members are aware of them.
When we cancel a deployment we will now verify that the corresponding cloud service deployment is actually cancelled, to protect against this rare set of circumstances in the future.

Once more, we deeply apologize for this outage and we will be taking steps to ensure it does not happen again.

Regards,

FMX Team

Posted Sep 03, 2020 - 10:08 EDT

Resolved

This incident has been resolved. We greatly apologize for the inconvenience!

Posted Aug 27, 2020 - 11:11 EDT

Monitoring

A fix has been implemented and we're monitoring the results. Service should now be returning.

Posted Aug 27, 2020 - 08:38 EDT

Identified

The issue has been identified and a fix is being implemented. We expect a return of service in 1 hour. We greatly apologize for the inconvenience.

Posted Aug 27, 2020 - 08:26 EDT

Investigating

We are currently investigating this issue. We greatly apologize for the inconvenience!

Posted Aug 27, 2020 - 00:00 EDT

This incident affected: Web App, API, Email, and Reporting Dashboards.