Degraded performance

Incident Report for FMX System

Postmortem

We want to thank you for using FMX to manage your operations. We understand that FMX is often a critical component of running your organization and therefore we take any service disruptions seriously. This postmortem report will help you to better understand what caused the interruption in service as well as how we plan to avoid issues like these in the future.

The root cause of the outage

On the morning of February 9th, we discovered that the page response time for the FMX application was higher than normal. For reference, “page response time” refers to the length of time it takes to load an individual page in an application. After investigating we determined that the cause of this high page response time was due to a lack of available hardware in the Microsoft Azure datacenter where our application is hosted. At any given time, FMX can tap into additional server capacity available in our datacenter to handle the load of increased traffic. On Tuesday morning as our traffic began to increase, we were unable to autoscale to a higher number of servers because there was no additional hardware available for us to scale up to.

Solution

To solve the problem, we manually removed all of our servers and redeployed our application in a new cluster of servers. This allowed us to access additional hardware elsewhere in Azure. During this period, users were unable to access FMX at all. Once the application was fully re-deployed, users were once more able to access the application, and the application’s page response time was once more back in its normal range.

Future mitigations

After we confirmed the root cause of the outage, we alerted Microsoft to the problem and asked for additional guidance. They alerted us that although this problem is quite rare, it is possible that we may run into it again. For context, we have used Azure as our hosting provider for 8 years and have only encountered this issue once.

Microsoft provided us with some advice on how to mitigate the problem without taking the application offline should we encounter it again.
Additionally, we have added alerts for when an autoscale failure occurs so that we can respond more quickly to the problem should it ever recur.
Lastly, we’re currently exploring options to improve our page response time during the time it would take for us to mitigate this problem if it reoccurs.

Once more, we deeply apologize for this outage and we will be taking steps to ensure that in the unlikely event we have this problem in the future, that we will limit the amount of disruption as much as possible.

‌

Regards,

FMX Team

Posted Feb 18, 2021 - 10:36 EST

Resolved

This issue has now been resolved. Normal site performance should be restored. We will continue to monitor the situation closely.

Posted Feb 09, 2021 - 10:21 EST

Monitoring

A fix has been implemented and we're currently monitoring the results. We will post a post-mortem when we have more details about the cause of the issue. We apologize for any inconvenience.

Posted Feb 09, 2021 - 09:48 EST

Update

We are continuing to investigate this issue. We apologize for the inconvenience.

Posted Feb 09, 2021 - 09:05 EST

Investigating

We are currently investigating an issue that is causing degraded performance. We apologize for any inconvenience.

Posted Feb 09, 2021 - 08:20 EST

This incident affected: Web App.