Degraded performance with device and behavior data ingestion
Incident Report for Sardine AI
Postmortem

Date: 2023-11-16

Introduction

  • Purpose: This report provides an in-depth analysis of the service degradation experienced in our events API.
  • Apology: We sincerely apologize for the inconvenience this incident may have caused and are committed to ensuring the highest level of service quality.

Incident Overview

  • Duration: Approximately 8 hours, from 2023-11-14 14:00:00 to 2023-11-15 00:00:00 UTC
  • Services Affected: Devices API, Events API.

Root Cause Analysis

  • Primary Issue: The degradation was primarily due to an unexpected increase in the number of entries in our 'clients' table, which significantly slowed down some of our auth endpoints.
  • Secondary Factor: This coincided with an increase in traffic as we onboard some enterprise customers, exacerbating the issue.
  • Third Factor: We have had a monitoring system around missing device data but it runs less frequently than it should, resulting in delays until on-call engineers get notified
  • Technical Impact: The result was a performance bottleneck in our events API.

Impact

  • Service Accessibility: Slower response times and potential timeouts in accessing the events API (internal API used by our SDK to ingest device and behavior data)
  • Operational Interruptions: Calls to our device endpoint had intermittent failures. Through internal investigation, we observed that 97k (~20%) of device API calls were unable to fetch the user’s device data.

Corrective Actions and Improvements

  • Immediate Measures: Our engineering team quickly gathered to address the issue once they were notified. Once the root cause was identified, they moved swiftly to:

    • Patch suboptimal queries in our database.
    • Increase the capacity of our cache to handle the higher load efficiently.
  • Monitoring Enhancements: We have enhanced our monitoring systems for both our cache and storage layers in the authentication service. These improvements include:

    • Advanced alert mechanisms for early detection of anomalies in data patterns.
    • Additional real-time monitors for our cache and storage layers.
    • Regular audits and stress tests to ensure the robustness of our systems under varying loads.

Conclusion

  • Commitment: We remain dedicated to providing reliable and efficient services.
  • Appreciation: Your understanding and trust are greatly valued.
Posted Nov 17, 2023 - 16:03 UTC

Resolved
Starting around 14:00-00:00 UTC (06:00-16:00 PT / 09:00-19:00 ET), clients may have experienced degraded device and behavior data ingestion, some of the sessions made during that time are missing DI/BB data. Issue is now resolve and we'll follow up with public facing post mortem.
Posted Nov 14, 2023 - 14:00 UTC