Customer's API outage
Incident Report for Sardine AI
Postmortem

Summary

Due to the misconfiguration of the internal system combined, one of our internal API requests contains a huge payload. To prevent a potential DDoS attack, the service stopped accepting these requests and began returning a status code 413 (Payload Too Large). This, in turn, caused our customer’s API calls to fail, resulting in status code 500 errors.

The issue was mitigated within a few minutes for most of the clients.

What we are doing about this:

  • Improve alerts

    • more sensible alerts around 5xx response
  • Improve the process for escalation and communication

  • Improve the process for pre-prod testing and rollout to prevent issues related to configuration update

Posted Jul 19, 2024 - 16:56 UTC

Resolved
/v1/customer API calls made between 06:55 AM PST to 07:00 AM PST failed.
Posted Jul 18, 2024 - 12:00 UTC