We want to thank you for using FMX to manage your operations. We understand that FMX is often a critical component of running your organization and therefore we take any service disruptions seriously. This postmortem report will help you to better understand what caused the interruption in service as well as how we plan to avoid issues like these in the future.
The root cause of the outage
On the morning of February 9th, we discovered that the page response time for the FMX application was higher than normal. For reference, “page response time” refers to the length of time it takes to load an individual page in an application. After investigating we determined that the cause of this high page response time was due to a lack of available hardware in the Microsoft Azure datacenter where our application is hosted. At any given time, FMX can tap into additional server capacity available in our datacenter to handle the load of increased traffic. On Tuesday morning as our traffic began to increase, we were unable to autoscale to a higher number of servers because there was no additional hardware available for us to scale up to.
To solve the problem, we manually removed all of our servers and redeployed our application in a new cluster of servers. This allowed us to access additional hardware elsewhere in Azure. During this period, users were unable to access FMX at all. Once the application was fully re-deployed, users were once more able to access the application, and the application’s page response time was once more back in its normal range.
After we confirmed the root cause of the outage, we alerted Microsoft to the problem and asked for additional guidance. They alerted us that although this problem is quite rare, it is possible that we may run into it again. For context, we have used Azure as our hosting provider for 8 years and have only encountered this issue once.
Once more, we deeply apologize for this outage and we will be taking steps to ensure that in the unlikely event we have this problem in the future, that we will limit the amount of disruption as much as possible.