Intermittent 503 errors due to API instability

Incident Report for Sardine AI

Postmortem

Introduction

  • Purpose: This report provides an overview of the recent service disruption impacting a few of our API endpoints
  • Apology: We sincerely apologize for any inconvenience this disruption may have caused. We remain dedicated to maintaining high service availability and reliability.

Incident Overview

  • Duration: 1 Minute around 17:36 UTC and 1 minute around 17:56 UTC.
  • Region Affected: US
  • Services Affected: Customers, Feedbacks and Issuing Risk API endpoints

Root Cause Analysis

  • Primary Issue:

    An overlooked edge case in the handler of a new API being rolled out internally

  • Detailed Explanation:

    The issue started occurring intermittently after we enabled a feature flag that replaces one API used internally by our dashboard. The handler of the new API did not account for one edge case that happened infrequently but which led pods running our application to crash. Even though this feature flag was exposed to very few clients, the pod crashing was enough to disrupt unrelated services.

Impact

  • Higher number of 503 responses from endpoints like /v1/customers, /v1/feedbacks, /v1/issuing/risk

Detection and Recovery Time

  • Why this wasn’t noticed earlier:

    The edge case in question wasn’t covered in tests. Once the issue was identified in prod, we immediately diagnosed and fixed it.

Corrective Actions and Improvements

  • Immediate Response:

    The feature was disabled in production.

  • Preventive Measures:

    We’ve fixed the handler of the new API and expanded our test coverage to ensure this cannot happen again.

Conclusion

  • Commitment:

    Sardine remains firmly committed to delivering reliable and resilient services to our partners. We deeply regret the inconvenience caused by this incident and appreciate your patience and understanding.

  • Appreciation:

    Thank you for your continued trust and partnership. We value your support as we strengthen our systems and processes to ensure greater reliability and stability.

Posted Sep 10, 2025 - 12:36 UTC

Resolved

This incident has been resolved.
Posted Sep 09, 2025 - 20:22 UTC

Identified

In some cases, the API is having an instability and returning a 503 error. Some endpoints that may have this behavior are: /v1/customers, /v1/feedbacks, v1/issuing/risk.

Our team is working on a fix for this issue.
Posted Sep 09, 2025 - 18:17 UTC
This incident affected: Issuing API and Customer APIs.