Degraded performance with Device/Customer API
Incident Report for Sardine AI
Postmortem

Timeline of events

Devices API latency (P95, P99) was higher than an expected threshold respectively for a fairly long duration of time. Customer API calls that were making internal devices API calls were timing out due to the high latency.

Root Cause

WebSDK deployment caused increase in number of events sent to device-events. The sudden and consistent increase in events lead to higher P90 and P99 latency to Bigtable (our primary database) which led to higher API latency.

Impact on Customers:

  1. Clients calling devices API experienced very high latency
  2. Clients calling customers API experienced no device features if latency >2s.

Start Time: 6:20 AM PST

End Time: 11:45 AM PST

  • Sep 21:
1. 8:18 AM PST on-call got paged for high devices API latency
2. An initial investigation showed Bigtable \(P90, P99\) latency was high for devices tables
3. Started a thread in Slack with DevOps to check infra status
4. A Web SDK deployment at 6:20 AM PST caused a high transaction count \(both reads and writes\) on Bigtable, which was the cause for higher latency.
5. Feature flag to disable duplicate events was triggered to reduce the load on Bigtable latency
6. 10:52 AM PST - Rollback of WebSDK was performed. Following a time delay to purge the CDN cache, the devices API service and Bigtable latency were restored.

We observed increased number of events

  • various BigTable operations (read/mutate across various tables) spiked during the outage
  • biometrics table hotkey via one specific client caused latency
  • session-porofiles table hotkey as well for same client

Key Takeaway

  1. Tighten devices API latency request count threshold and rolling window to alert on-call and engineering team faster.

Action Items

  1. Update Device API latency alert, reducing current threshold as well as rolling window - DONE
  2. Update Devices API dashboard to include more Bigtable performance metadata - DONE
  3. trace for events API - DONE
  4. Add logic to drop behavior data for specific client - DONE
  5. Add kill switch to behavior data via feature flag - TODO
Posted Nov 07, 2023 - 18:54 UTC

Resolved
This incident has been resolved.
Posted Sep 21, 2023 - 19:07 UTC
Update
Full operation was restored, and we will publish a post-mortem as soon as possible.
Posted Sep 21, 2023 - 19:06 UTC
Investigating
Starting around 1830 UTC, clients may be experiencing Device API and/or Customer API degraded performance. Device API may be seeing higher latency than usual. Customers API may be seeing slightly higher latency than usual, some responses may be missing device related data.
Posted Sep 21, 2023 - 18:42 UTC
This incident affected: Device APIs and Customer APIs.