The latency for /v1/customers
v1/issuing/risks
and /v1/feedbacks
API was increased for majority of customers from 21:02 - 23:51 Oct 9 Pacific Time.
This was caused by certain traffic pattern (likely fraud attack) that stressed one of our backend DBs. While overall traffic volume was same as usual, the certain traffic pattern stressed Sardine’s feature computation backend and caused slow queries.
(all Pacific Time)
7:43 PM Oct 9: Initial latency alert triggered. Since it was self resolved after a few minutes, no further investigation was done
9:16 PM Oct 9: another latency alert triggered, it got self-resolved after 10 minutes as well. Oncall engineer assumed it was transient spike and didn’t investigate further. We had couple other alerts but those were assumed to be noisy alerts
11:02PM: Sardine received error reports from a few clients
11:32PM: Issue got escalated by one of our Integration Managers.
11:40PM: Oncall engineer identified database CPU usage is extremely high
11:48PM: Oncall engineer performed database scaling
11:51 PM: Incident resolved
Improving alerts and pager setup
Build auto-scaler for database
Establish better runbook so oncall engineer can diagnose and act faster
Improve backend code so it’s more robust against this type of traffic pattern