Degraded performance with Device/Customer API

Incident Report for Sardine AI

Postmortem

Timeline of events

Devices API latency (P95, P99) was higher than an expected threshold respectively for a fairly long duration of time. Customer API calls that were making internal devices API calls were timing out due to the high latency.

Root Cause

WebSDK deployment caused increase in number of events sent to device-events. The sudden and consistent increase in events lead to higher P90 and P99 latency to Bigtable (our primary database) which led to higher API latency.

Impact on Customers:

Clients calling devices API experienced very high latency
Clients calling customers API experienced no device features if latency >2s.

Start Time: 6:20 AM PST

End Time: 11:45 AM PST

Sep 21:

1. 8:18 AM PST on-call got paged for high devices API latency
2. An initial investigation showed Bigtable \(P90, P99\) latency was high for devices tables
3. Started a thread in Slack with DevOps to check infra status
4. A Web SDK deployment at 6:20 AM PST caused a high transaction count \(both reads and writes\) on Bigtable, which was the cause for higher latency.
5. Feature flag to disable duplicate events was triggered to reduce the load on Bigtable latency
6. 10:52 AM PST - Rollback of WebSDK was performed. Following a time delay to purge the CDN cache, the devices API service and Bigtable latency were restored.

We observed increased number of events

various BigTable operations (read/mutate across various tables) spiked during the outage
biometrics table hotkey via one specific client caused latency
session-porofiles table hotkey as well for same client

Key Takeaway

Tighten devices API latency request count threshold and rolling window to alert on-call and engineering team faster.

Action Items

Update Device API latency alert, reducing current threshold as well as rolling window - DONE
Update Devices API dashboard to include more Bigtable performance metadata - DONE
trace for events API - DONE
Add logic to drop behavior data for specific client - DONE
Add kill switch to behavior data via feature flag - TODO

Posted Nov 07, 2023 - 18:54 UTC

Resolved

This incident has been resolved.

Posted Sep 21, 2023 - 19:07 UTC

Update

Full operation was restored, and we will publish a post-mortem as soon as possible.

Posted Sep 21, 2023 - 19:06 UTC

Investigating

Starting around 1830 UTC, clients may be experiencing Device API and/or Customer API degraded performance. Device API may be seeing higher latency than usual. Customers API may be seeing slightly higher latency than usual, some responses may be missing device related data.

Posted Sep 21, 2023 - 18:42 UTC

This incident affected: Device APIs and Customer APIs.