System Instability across backend APIs and Dashboard

Incident Report for Sardine AI

Postmortem

Post-Incident Report: Service Degradation (November 28 & 29, 2025)

(All timestamps are in UTC)

Summary

  • First Incident (Nov 28): Elevated error rates and intermittent access were observed for approximately 5 minutes (06:43–06:48 UTC).
  • Second Incident (Nov 29): 70-minute window of elevated errors (18:10–19:20 UTC), including a severe service degradation period of roughly 22 minutes.

Root Cause

This was caused by an unexpected and significant spike in traffic volume. The surge in requests temporarily exceeded our forecasted capacity and our auto scaling capability, causing congestion in our application layer.Impact

Symptoms

During these windows, customers using the Dashboard and Core Risk APIs experienced increased latency and 502/503/504 errors.Resolution and Next Steps

Short Term Solution

Our engineering teams intervened to stabilize the platform during the events.

Long Term Solution

To ensure our systems remain resilient against future traffic spikes of this magnitude, we are currently provisioning additional infrastructure and permanently increasing our system capacity.

Posted Dec 04, 2025 - 17:21 UTC

Resolved

The issue has been fully resolved, and all systems are operating normally.
Posted Nov 29, 2025 - 19:20 UTC

Monitoring

Service has been restored as of 18:45 UTC.
We are closely monitoring system performance to ensure stability.
Posted Nov 29, 2025 - 18:56 UTC

Investigating

We are currently investigating an outage affecting system availability.
Further updates will be provided as they become available.
Posted Nov 29, 2025 - 18:43 UTC
This incident affected: Device APIs, Issuing API, Customer APIs, Dashboard, and Crypto APIs.