Elevated latency for customers API and issuing risk API

Incident Report for Sardine AI

Postmortem

Overview

Repeated timeouts and increased query latency on a few of read replica clusters resulted in client-visible API latencies.

What happened

Recent code changes and changes in traffic patterns resulted in slow database queries. This resulted in slower latency, resulting in retries from some of our clients. Because we auto-scaled pods based on traffic volumes, that resulted in spikes in database connections, which resulted in further performance issues.

Impact

Our API latency has degraded severely in following times

March 20 20:06-20:41

March 22 1:00-1:23

March 22 2:14-2:34

March 25 18:26-19:01

What went wrong

  • Internal communication took a while before we can communicate issue with our clients
  • Detecting root cause took us a while

Action items

Action Item with Description target
Scale up database resources DONE
Update database connection limit and other configurations March 31
provision a separate DB resource fo one of our service April 1
Tighten up internal timeout config April 1
Optimize known slow query 2 April 1
Optimize known slow query 1 DONE
Optimize feature computation backend ongoing project, end of Q2
Posted Mar 26, 2025 - 23:59 UTC

Resolved

This incident has been resolved.
Posted Mar 25, 2025 - 19:01 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 25, 2025 - 18:49 UTC

Investigating

We are currently investigating this issue.
Posted Mar 25, 2025 - 18:26 UTC
This incident affected: Customer APIs.