Summary
On May 6, 2026 from approximately 16:48 to 17:57 UTC, customers using Sardine's /v1/customers and /v1/issuing/risks APIs experienced elevated latency and degraded responses. We sincerely apologize for the disruption this caused. This document summarizes what happened, why it happened, and the steps we are taking to prevent recurrence.
What Happened
During this window, requests to the affected endpoints experienced one of two behaviors:
SITO reason code, indicating that Sardine was unable to compute certain risk signals within the expected timeframe. Customers still received rule evaluation results, but with limited signals./v1/customers traffic, requests returned HTTP 500 errors.The incident was resolved at approximately 17:57 UTC after our team rerouted database traffic to a healthy replicas.
Why It Happened
The root cause was an infrastructure failure in our cloud provider's (Google Cloud) database service in the US-central1 region. Internal resource shortage for certain instance types caused a routine automatic update operation on our primary read replica to fail. While Google Cloud UI and CLI reported instance to be healthy, database instances were not properly handling incoming queries. Our team performed a manual failover to redirect traffic to a healthy replicas, which restored service.
What We're Doing About It
We are taking the following actions to reduce the likelihood and impact of similar incidents:
~~We are also requiring a full root cause analysis from Google Cloud within 3 business days.~~
[EDIT: We have received RCA from Google. Here is expert from Google’s RCA with slight edit - On May 6, 2026, an database instance in the us-central1 region experienced total read unavailability following a series of scale-out operations. The incident was driven by a combination of regional resource exhaustion (stockout) and a logic error in the Managed Instance Group (MIG) downsizing algorithm. The MIG incorrectly prioritized the removal of healthy, running virtual machines (VMs) over non-functional "phantom" instances during automated reconciliation.
N2) with sufficient regional capacity to restore service immediately.Algorithm update: A fix for the MIG’s downsizing algorithm has been developed, verified and is currently rolling out. This update ensures that non-running instances are always prioritized for deletion over healthy ones when removing nodes. The global rollout of this fix is scheduled for completion by the end of May 2026, following standard safety and validation procedures.
EDIT END]
We take the reliability of our platform seriously and apologize again for the impact this incident had on your operations. Please reach out to your account team or risksupport@sardine.ai if you have questions.