Sardine Outage
Incident Report for Sardine AI
Postmortem

Summary

Due to an error while reading a third party configuration a subset of APIs were affected – Customers, Issuing, Feedback and Identity Document APIs – from 7:45AM PST to 8:05 PST.

Timeline

  1. 18th Nov 10:46 PM PST - third party configuration was updated but JSON was malformed
  2. 22nd Nov 3:00 AM PST - one of two service instances restarted and started to crash loop as it was unable to read the malformed configuration, while the other instance was still active and was handling all traffic
  3. 23rd Nov 7:45 AM PST - the second instance restarted and started to crash loop as well because of the same issue
  4. 23rd Nov 8:00 AM PST - Sardine Engineering was alerted of an outage
  5. 23rd Nov 8:05 AM PST - we updated the malformed configuration and service was restored

Impact

  • From 7:45 AM PST to 8:05 AM PST, a subset of APIs were down
  • Including the following APIs:

    • Customers
    • Issuing
    • Feedback
    • Identity Document
  • During this time, we returned HTTP 502 response for above APIs

  • Device events related APIs were not impacted (these include Devices API, SDK events, etc.)

Action Items

  • [x] Add monitoring for load balancer HTTP 5XX errors

    • [x] Set up to page in production
  • [ ] Add alerting for “panic” keyword (during incident, the word “panic” was in INFO-level log statement therefore was not picked up by any alerts or monitors)

  • [ ] Alert based on individual instance health check

  • [ ] Improve robustness of loading configurations and initialization for third party services

    • [ ] Determine if vendor client initializations should be non-blocking
    • [ ] Make configuration updates in GCP more robust
  • [x] Set up 3rd party site monitoring service

Posted Nov 24, 2022 - 00:16 UTC

Resolved
Due to a configuration error on our part, we were temporarily down from 7:44 PST to 8:05 PST. All services are back to operational. We sincerely apologize for the outage and we will be providing a full post mortem as we continue to investigate.

For ~20 min between 7:44 PST to 8:05 PST
- Customers, Issuing, Feedback and Identity Document APIs were down.
- Events and Devices APIs were not affected.
Posted Nov 23, 2022 - 16:00 UTC