Repeated timeouts and increased query latency on a few of read replica clusters resulted in client-visible API latencies.
Recent code changes and changes in traffic patterns resulted in slow database queries. This resulted in slower latency, resulting in retries from some of our clients. Because we auto-scaled pods based on traffic volumes, that resulted in spikes in database connections, which resulted in further performance issues.
Our API latency has degraded severely in following times
March 20 20:06-20:41
March 22 1:00-1:23
March 22 2:14-2:34
March 25 18:26-19:01
Action Item with Description | target |
---|---|
Scale up database resources | DONE |
Update database connection limit and other configurations | March 31 |
provision a separate DB resource fo one of our service | April 1 |
Tighten up internal timeout config | April 1 |
Optimize known slow query 2 | April 1 |
Optimize known slow query 1 | DONE |
Optimize feature computation backend | ongoing project, end of Q2 |