Devices API latency (P95, P99) was higher than an expected threshold respectively for a fairly long duration of time. Customer API calls that were making internal devices API calls were timing out due to the high latency.
WebSDK deployment caused increase in number of events sent to device-events
. The sudden and consistent increase in events lead to higher P90 and P99 latency to Bigtable (our primary database) which led to higher API latency.
Start Time: 6:20 AM PST
End Time: 11:45 AM PST
1. 8:18 AM PST on-call got paged for high devices API latency
2. An initial investigation showed Bigtable \(P90, P99\) latency was high for devices tables
3. Started a thread in Slack with DevOps to check infra status
4. A Web SDK deployment at 6:20 AM PST caused a high transaction count \(both reads and writes\) on Bigtable, which was the cause for higher latency.
5. Feature flag to disable duplicate events was triggered to reduce the load on Bigtable latency
6. 10:52 AM PST - Rollback of WebSDK was performed. Following a time delay to purge the CDN cache, the devices API service and Bigtable latency were restored.
We observed increased number of events
biometrics
table hotkey via one specific client caused latencysession-porofiles
table hotkey as well for same client