Introduction
On Saturday, 08.02.2020, Airtame Cloud suffered a service disruption from approximately 03:10 to 16:50 UTC, during which most users were unable to use Airtame Cloud.
We apologise for the service disruption. With this postmortem we would like to explain how this service disruption was handled, and what we will do to minimise the risk of future service disruptions.
Timeline
- 03:10 - We receive alerts of high CPU usage on our database instance.
- 12:14 - Engineering starts investigating the issue.
- 13:46 - A potential issue has been identified with a performance test and the performance test is stopped. Service is briefly restored. This was not due to the stopped performance test, but because we stopped backend services to prepare a failover of our database. This was not clear then, as metrics shipment turned out to be delayed.
- 14:03 - The issue starts to occur again with the stopped performance test. Investigation of the root cause continues.
- 16:17 - The root cause has been identified with Airtame device firmware 3.8.0-b3 and above. The issue is regarding the new logic of the Cloud component on the device working with the backend in the Cloud.
- 16:20 - A hotfix is being deployed on the backend to stop these firmware versions and above to connect to our Cloud.
- 16:47 - The issue is mitigated and the service restored for devices running firmware 3.8.0-b2 and below.
On Monday, 10.02.2020, a backend fix is developed to also allow firmware versions 3.8.0-b3 and above to connect to our Cloud again. This fix is deployed by 15:00 UTC.
Explanation
In the Cloud component of the affected firmware versions, a device UUID handler was introduced. This UUID handler would do a full table scan of our database, leading to high CPU usage on our database. This table scan would occur each time the device connects to our backend.
On Friday, we saw a 25% increase of users with devices running firmware versions 3.8.0-b3 and above. While the absolute number of added devices was small (~200), this was enough to cause a cascading failure due to a combination of circumstances:
- The affected devices would connect to our backend.
- Each connected device would cause the backend to do a full table scan, causing high CPU usage on our database.
- This would result in an increase in query latencies, which in turn would result in WebSocket disconnections.
- The devices would try to reconnect the WebSockets, leading to an even higher database load, and thus latencies. The number of connections piling up then led to memory issues on our backends.
- Finally, our backends ran out of memory, causing all devices to disconnect from the Cloud entirely. After a random timeout, they would attempt to reconnect, meaning our backends were unable to recover with the database continuously locked up.
- Once the affected versions were blocked from connecting to the Cloud, our database and backends were able to recover and service was restored.
Learnings
We have recently added performance tests, and will continue to add further checks to these. Even though our monitoring system detected the database CPU usage increase, it didn't report the increase in error rates on the WebSocket endpoints. Since the incident, we already implemented new checks that monitor the error levels on our public load balancers. This is currently being validated.