Due to an error while reading a third party configuration a subset of APIs were affected – Customers, Issuing, Feedback and Identity Document APIs – from 7:45AM PST to 8:05 PST.
18th Nov 10:46 PM PST - third party configuration was updated but JSON was malformed
22nd Nov 3:00 AM PST - one of two service instances restarted and started to crash loop as it was unable to read the malformed configuration, while the other instance was still active and was handling all traffic
23rd Nov 7:45 AM PST - the second instance restarted and started to crash loop as well because of the same issue
23rd Nov 8:00 AM PST - Sardine Engineering was alerted of an outage
23rd Nov 8:05 AM PST - we updated the malformed configuration and service was restored
From 7:45 AM PST to 8:05 AM PST, a subset of APIs were down
Including the following APIs:
During this time, we returned HTTP 502 response for above APIs
Device events related APIs were not impacted (these include Devices API, SDK events, etc.)
[x] Add monitoring for load balancer HTTP 5XX errors
[x] Set up to page in production
[ ] Add alerting for “panic” keyword (during incident, the word “panic” was in INFO-level log statement therefore was not picked up by any alerts or monitors)
[ ] Alert based on individual instance health check
[ ] Improve robustness of loading configurations and initialization for third party services
[ ] Determine if vendor client initializations should be non-blocking
[ ] Make configuration updates in GCP more robust
[x] Set up 3rd party site monitoring service
Posted Nov 24, 2022 - 00:16 UTC
Due to a configuration error on our part, we were temporarily down from 7:44 PST to 8:05 PST. All services are back to operational. We sincerely apologize for the outage and we will be providing a full post mortem as we continue to investigate.
For ~20 min between 7:44 PST to 8:05 PST - Customers, Issuing, Feedback and Identity Document APIs were down. - Events and Devices APIs were not affected.