Summary
Between 4:23 PM to 5:03 PM pacific time on May 16th we were experienced degraded performance for customers API and issuing/risks API. This was caused by outage of our cloud provider.
What Happened?
Due to the network connectivity outage caused by our cloud provider (Google Cloud) experienced and network connectivity outage from 3:51PM to 6:40PM
https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre
This caused one of our primary database to stop accepting any database connection, resulted in increased rate of timeout and error responses. We setup our database in HA settings so auto-failover should have kicked in, but it didn’t happen. Oncall engineer manually triggered failover and issue was resolved.
Timeline
4:23 Incident started
4:27 Oncall engineer got paged via pagerduty
4:27 Oncall engineer acknowledged. He initially thought issue is due to increased traffic from customer and newly developed feature, trying to toggle certain feature flags to reduce the load
4:57 Outage continues. A few engineers joined oncall and they realized number of database connection is very low
5:00 We decided to manually trigger the failover
5:03 fail over completes, issue resolved
What Are We Doing About This?