Elevated latency for customers API and issuing risk API

Incident Report for Sardine AI

Postmortem

Overview

Repeated timeouts and increased query latency on a few of read replica clusters resulted in client-visible API latencies.

What happened

Recent code changes and changes in traffic patterns resulted in slow database queries. This resulted in slower latency, resulting in retries from some of our clients. Because we auto-scaled pods based on traffic volumes, that resulted in spikes in database connections, which resulted in further performance issues.

Impact

Our API latency has degraded severely in following times

March 20 20:06-20:41

March 22 1:00-1:23

March 22 2:14-2:34

March 25 18:26-19:01

What went wrong

Internal communication took a while before we can communicate issue with our clients
Detecting root cause took us a while

Action items

Action Item with Description	target
Scale up database resources	DONE
Update database connection limit and other configurations	March 31
provision a separate DB resource fo one of our service	April 1
Tighten up internal timeout config	April 1
Optimize known slow query 2	April 1
Optimize known slow query 1	DONE
Optimize feature computation backend	ongoing project, end of Q2

Posted Mar 26, 2025 - 23:59 UTC

Resolved

This incident has been resolved.

Posted Mar 25, 2025 - 19:01 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 25, 2025 - 18:49 UTC

Investigating

We are currently investigating this issue.

Posted Mar 25, 2025 - 18:26 UTC

This incident affected: Customer APIs.