Elevated latency for customers API
Incident Report for Sardine AI
Postmortem

Overview

The latency for /v1/customers v1/issuing/risks and /v1/feedbacks API was increased for majority of customers from 21:02 - 23:51 Oct 9 Pacific Time.

This was caused by certain traffic pattern (likely fraud attack) that stressed one of our backend DBs. While overall traffic volume was same as usual, the certain traffic pattern stressed Sardine’s feature computation backend and caused slow queries.

Impact

  • Customers had increased latency at these time periods
  • We saw increased number of database timeout, meaning some features were not correctly computed

Timeline

(all Pacific Time)

7:43 PM Oct 9: Initial latency alert triggered. Since it was self resolved after a few minutes, no further investigation was done

9:16 PM Oct 9: another latency alert triggered, it got self-resolved after 10 minutes as well. Oncall engineer assumed it was transient spike and didn’t investigate further. We had couple other alerts but those were assumed to be noisy alerts

11:02PM: Sardine received error reports from a few clients

11:32PM: Issue got escalated by one of our Integration Managers.

11:40PM: Oncall engineer identified database CPU usage is extremely high

11:48PM: Oncall engineer performed database scaling

11:51 PM: Incident resolved

What went wrong

  • Self-resolved pagers were ignored at night as they tend to be pretty noisy
  • Investigation took us lot of time
  • There is no auto-scaling available for this database product
  • We didn’t have enough safeguard about this particular traffic pattern that caused spike in the latency

Action items

  • Improving alerts and pager setup

    • Enhance alerts around DB metrics
  • Build auto-scaler for database

  • Establish better runbook so oncall engineer can diagnose and act faster

  • Improve backend code so it’s more robust against this type of traffic pattern

Posted Oct 10, 2024 - 07:50 UTC

Resolved
Sardine's platform experienced elevated latency for the /customers API between 21:02 - 23:50 pacific time
Posted Oct 09, 2024 - 05:30 UTC