Elevated API latency

Incident Report for Sardine AI

Postmortem

Overview

On Jan 31, Sardine experienced an increase in latency due to increased database usage. /customers, /issuing/risks, and /devices API experienced intermittent higher latency during 9:03 - 13:16 Pacific time.

What happened

Sardine experienced an increase in traffic due to overall organic traffic growth. Additionally, some of our clients sent us batch-based traffic that was bursty in the nature. Some of our database didn’t scale well to handle the bursty traffic.

Impact

/customers, /issuing/risks, and /devices API experienced intermittent higher latency during 8:03 - 13:16 Pacific time.

Timeline (all Pacific time)

8:03 AM: Incident starts
8:04 AM: Oncall engineer paged due to the increased latency
8:21 AM: Alert was auto resolved. Oncall engineer concluded it was one-off latency spike
Around 9:00 AM: Client reached out to Sardine due to latency concern
9:00AM - 13:00: Multiple alerts were triggered. Oncall engineer were investigating
13:33: root cause identified
13:40: scaling and config change were performed (incident ends)

What went wrong

While internal alert notified oncall engineer, it wasn’t escalated soon enough
Inefficient internal logic caused performance bottleneck
Rate limit didn’t provide the sufficient control

Action items

Enhance internal process around alerts and escalation
Fix inefficient old logic to avoid future similar issue

Posted Jan 31, 2025 - 22:44 UTC

Resolved

This incident has been resolved.

We'll be adding a PostMortem with the full description and action items here as soon as we have it.

Posted Jan 31, 2025 - 22:10 UTC

Investigating

We are currently experiencing elevated latency and are actively working to resolve the issue

Posted Jan 31, 2025 - 21:48 UTC

This incident affected: Device APIs and Customer APIs.