Increased latency in PROD US for /v1/customers and /v1/issuing/risks endpoints

Incident Report for Sardine AI

Postmortem

Summary

On May 6, 2026 from approximately 16:48 to 17:57 UTC, customers using Sardine's /v1/customers and /v1/issuing/risks APIs experienced elevated latency and degraded responses. We sincerely apologize for the disruption this caused. This document summarizes what happened, why it happened, and the steps we are taking to prevent recurrence.

What Happened

During this window, requests to the affected endpoints experienced one of two behaviors:

  • Elevated latency with a SITO reason code, indicating that Sardine was unable to compute certain risk signals within the expected timeframe. Customers still received rule evaluation results, but with limited signals.
  • For approximately 14% of /v1/customers traffic, requests returned HTTP 500 errors.

The incident was resolved at approximately 17:57 UTC after our team rerouted database traffic to a healthy replicas.

Why It Happened

The root cause was an infrastructure failure in our cloud provider's (Google Cloud) database service in the US-central1 region. Internal resource shortage for certain instance types caused a routine automatic update operation on our primary read replica to fail. While Google Cloud UI and CLI reported instance to be healthy, database instances were not properly handling incoming queries. Our team performed a manual failover to redirect traffic to a healthy replicas, which restored service.

What We're Doing About It

We are taking the following actions to reduce the likelihood and impact of similar incidents:

  • Improved monitoring: We are adding alerts for failed database update operations and for "zombie" database states where an instance appears healthy but is not accepting queries.
  • Failover runbook: We are formalizing a documented procedure for read replica failover, including how to identify a failover target, resize a replica, update service configuration, and restart affected services.
  • Graceful degradation: We are investigating how to maintain partial service when a read replica is unavailable, rather than surfacing timeouts to customers.
  • Application timeout enforcement: We are reviewing and correcting how our services enforce database query timeouts to ensure failures surface quickly rather than hanging.
  • Database architecture review: We have scheduled a review with our cloud provider to evaluate high-availability configuration improvements and reduce our exposure to single-replica failure modes.

We are also requiring a full root cause analysis from Google Cloud within 3 business days.

We take the reliability of our platform seriously and apologize again for the impact this incident had on your operations. Please reach out to your account team or risksupport@sardine.ai if you have questions.

Posted May 06, 2026 - 23:28 UTC

Resolved

This incident has been resolved.
Posted May 06, 2026 - 18:11 UTC

Monitoring

We have mitigated the service degradation that began at approximately 16:48 UTC. Our team identified a failure in one of our database read replica instances on Google Cloud Platform and performed a manual failover to redirect traffic to a healthy replica with additional capacity. Error rates and latency have returned to normal levels as of approximately 17:57 UTC.

The incident lasted approximately 69 minutes. We are actively working with our cloud provider to determine the root cause of the replica failure and prevent recurrence.
Posted May 06, 2026 - 18:07 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted May 06, 2026 - 17:40 UTC

Update

We are continuing to investigate this issue.
Posted May 06, 2026 - 17:15 UTC

Investigating

We are currently investigating this issue.
Posted May 06, 2026 - 17:01 UTC
This incident affected: Issuing API and Customer APIs.