Severe degradation of customers API and issuing/risks API

Incident Report for Sardine AI

Postmortem

Summary

Between 4:23 PM to 5:03 PM pacific time on May 16th we were experienced degraded performance for customers API and issuing/risks API. This was caused by outage of our cloud provider.

What Happened?

Due to the network connectivity outage caused by our cloud provider (Google Cloud) experienced and network connectivity outage from 3:51PM to 6:40PM

https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre

This caused one of our primary database to stop accepting any database connection, resulted in increased rate of timeout and error responses. We setup our database in HA settings so auto-failover should have kicked in, but it didn’t happen. Oncall engineer manually triggered failover and issue was resolved.

Timeline

4:23 Incident started

4:27 Oncall engineer got paged via pagerduty

4:27 Oncall engineer acknowledged. He initially thought issue is due to increased traffic from customer and newly developed feature, trying to toggle certain feature flags to reduce the load

4:57 Outage continues. A few engineers joined oncall and they realized number of database connection is very low

5:00 We decided to manually trigger the failover

5:03 fail over completes, issue resolved

What Are We Doing About This?

Conduct fire-drill session so we can diagnose and action much faster

Posted May 17, 2024 - 21:42 UTC

Resolved

Between 4:23 PM to 5:03 PM pacific time we were experienced degraded performance for customers API and issuing/risks API. This was caused by outage of our cloud provider

Posted May 17, 2024 - 00:34 UTC