Service disruption

Incident Report for Airtame

Postmortem

Introduction

On Saturday, 08.02.2020, Airtame Cloud suffered a service disruption from approximately 03:10 to 16:50 UTC, during which most users were unable to use Airtame Cloud.

We apologise for the service disruption. With this postmortem we would like to explain how this service disruption was handled, and what we will do to minimise the risk of future service disruptions.

Timeline

03:10 - We receive alerts of high CPU usage on our database instance.
12:14 - Engineering starts investigating the issue.
13:46 - A potential issue has been identified with a performance test and the performance test is stopped. Service is briefly restored. This was not due to the stopped performance test, but because we stopped backend services to prepare a failover of our database. This was not clear then, as metrics shipment turned out to be delayed.
14:03 - The issue starts to occur again with the stopped performance test. Investigation of the root cause continues.
16:17 - The root cause has been identified with Airtame device firmware 3.8.0-b3 and above. The issue is regarding the new logic of the Cloud component on the device working with the backend in the Cloud.
16:20 - A hotfix is being deployed on the backend to stop these firmware versions and above to connect to our Cloud.
16:47 - The issue is mitigated and the service restored for devices running firmware 3.8.0-b2 and below.

‌

On Monday, 10.02.2020, a backend fix is developed to also allow firmware versions 3.8.0-b3 and above to connect to our Cloud again. This fix is deployed by 15:00 UTC.

Explanation

In the Cloud component of the affected firmware versions, a device UUID handler was introduced. This UUID handler would do a full table scan of our database, leading to high CPU usage on our database. This table scan would occur each time the device connects to our backend.

On Friday, we saw a 25% increase of users with devices running firmware versions 3.8.0-b3 and above. While the absolute number of added devices was small (~200), this was enough to cause a cascading failure due to a combination of circumstances:

The affected devices would connect to our backend.
Each connected device would cause the backend to do a full table scan, causing high CPU usage on our database.
This would result in an increase in query latencies, which in turn would result in WebSocket disconnections.
The devices would try to reconnect the WebSockets, leading to an even higher database load, and thus latencies. The number of connections piling up then led to memory issues on our backends.
Finally, our backends ran out of memory, causing all devices to disconnect from the Cloud entirely. After a random timeout, they would attempt to reconnect, meaning our backends were unable to recover with the database continuously locked up.
Once the affected versions were blocked from connecting to the Cloud, our database and backends were able to recover and service was restored.

Learnings

We have recently added performance tests, and will continue to add further checks to these. Even though our monitoring system detected the database CPU usage increase, it didn't report the increase in error rates on the WebSocket endpoints. Since the incident, we already implemented new checks that monitor the error levels on our public load balancers. This is currently being validated.

Posted Feb 13, 2020 - 09:06 UTC

Resolved

This incident has been resolved.

Posted Feb 08, 2020 - 16:37 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 08, 2020 - 16:20 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 08, 2020 - 16:17 UTC

Investigating

We continue to investigate the root cause of the issues.

Posted Feb 08, 2020 - 14:03 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 08, 2020 - 13:46 UTC

Update

We are continuing to investigate this issue.

Posted Feb 08, 2020 - 13:25 UTC

Investigating

We are currently experiencing a service disruption and are investigating the issue.

Posted Feb 08, 2020 - 12:14 UTC

This incident affected: Airtame Cloud and AWS rds-eu-central-1.