Elevated latency for customers API

Incident Report for Sardine AI

Postmortem

Overview

The latency for /v1/customers v1/issuing/risks and /v1/feedbacks API was increased for majority of customers from 21:02 - 23:51 Oct 9 Pacific Time.

This was caused by certain traffic pattern (likely fraud attack) that stressed one of our backend DBs. While overall traffic volume was same as usual, the certain traffic pattern stressed Sardine’s feature computation backend and caused slow queries.

Impact

Customers had increased latency at these time periods
We saw increased number of database timeout, meaning some features were not correctly computed

Timeline

(all Pacific Time)

7:43 PM Oct 9: Initial latency alert triggered. Since it was self resolved after a few minutes, no further investigation was done

9:16 PM Oct 9: another latency alert triggered, it got self-resolved after 10 minutes as well. Oncall engineer assumed it was transient spike and didn’t investigate further. We had couple other alerts but those were assumed to be noisy alerts

11:02PM: Sardine received error reports from a few clients

11:32PM: Issue got escalated by one of our Integration Managers.

11:40PM: Oncall engineer identified database CPU usage is extremely high

11:48PM: Oncall engineer performed database scaling

11:51 PM: Incident resolved

What went wrong

Self-resolved pagers were ignored at night as they tend to be pretty noisy
Investigation took us lot of time
There is no auto-scaling available for this database product
We didn’t have enough safeguard about this particular traffic pattern that caused spike in the latency

Action items

Improving alerts and pager setup
- Enhance alerts around DB metrics
Build auto-scaler for database
Establish better runbook so oncall engineer can diagnose and act faster
Improve backend code so it’s more robust against this type of traffic pattern

Posted Oct 10, 2024 - 07:50 UTC

Resolved

Sardine's platform experienced elevated latency for the /customers API between 21:02 - 23:50 pacific time

Posted Oct 09, 2024 - 05:30 UTC