Degraded performance with device and behavior data ingestion

Incident Report for Sardine AI

Postmortem

Date: 2023-11-16

Introduction

Purpose: This report provides an in-depth analysis of the service degradation experienced in our events API.
Apology: We sincerely apologize for the inconvenience this incident may have caused and are committed to ensuring the highest level of service quality.

Incident Overview

Duration: Approximately 8 hours, from 2023-11-14 14:00:00 to 2023-11-15 00:00:00 UTC
Services Affected: Devices API, Events API.

Root Cause Analysis

Primary Issue: The degradation was primarily due to an unexpected increase in the number of entries in our 'clients' table, which significantly slowed down some of our auth endpoints.
Secondary Factor: This coincided with an increase in traffic as we onboard some enterprise customers, exacerbating the issue.
Third Factor: We have had a monitoring system around missing device data but it runs less frequently than it should, resulting in delays until on-call engineers get notified
Technical Impact: The result was a performance bottleneck in our events API.

Impact

Service Accessibility: Slower response times and potential timeouts in accessing the events API (internal API used by our SDK to ingest device and behavior data)
Operational Interruptions: Calls to our device endpoint had intermittent failures. Through internal investigation, we observed that 97k (~20%) of device API calls were unable to fetch the user’s device data.

Corrective Actions and Improvements

Immediate Measures: Our engineering team quickly gathered to address the issue once they were notified. Once the root cause was identified, they moved swiftly to:
- Patch suboptimal queries in our database.
- Increase the capacity of our cache to handle the higher load efficiently.
Monitoring Enhancements: We have enhanced our monitoring systems for both our cache and storage layers in the authentication service. These improvements include:
- Advanced alert mechanisms for early detection of anomalies in data patterns.
- Additional real-time monitors for our cache and storage layers.
- Regular audits and stress tests to ensure the robustness of our systems under varying loads.

Conclusion

Commitment: We remain dedicated to providing reliable and efficient services.
Appreciation: Your understanding and trust are greatly valued.

Posted Nov 17, 2023 - 16:03 UTC

Resolved

Starting around 14:00-00:00 UTC (06:00-16:00 PT / 09:00-19:00 ET), clients may have experienced degraded device and behavior data ingestion, some of the sessions made during that time are missing DI/BB data. Issue is now resolve and we'll follow up with public facing post mortem.

Posted Nov 14, 2023 - 14:00 UTC