Dashboard slow or unreachable

Incident Report for Kundo

Postmortem

We apologize for any inconvenience caused by this incident. A summary of the events and measures taken follows.

Summary

A security update caused unexpected performance issues, causing the Kundo dashboard to be unavailable for many users.

Timeline (in CET)

15:47 - An alarm about reduced performance in our systems is triggered.

15:50 - Our incident process is initiated and troubleshooting is started.

15:57 - Information about the incident is published on status.kundo.se

16:11 - The root cause of the incident is identified to be caused by a change to one of our database schemas - an important change related to an upgrade of one of our core services that serves among other features Kundo's Dashboard.

16:12 - Database load is reduced by scaling down consumer services. The on-going database change is monitored and information is gathered to allow for informed decision making. Performance is reduced in Kundo's core services, including the Dashboard, and users experience the Dashboard as slow or even completely unresponsive.

16:24 - Resources are added for serving web requests to the Dashboard to reduce user impact.

16:34 - The database change is completed and the system starts to return to normal.

16:41 - All systems are back to normal load and Incident status is set to Resolved

17:07 - All customers that have contacted Kundo by email are informed about the incident resolution.

What happened?

One of our engineering teams prepared for an upgrade of one of our web frameworks - a necessary upgrade that is required to among other things maintain a high level of security. The upgrade was prepared the days before the incident and planned changes had passed through our internal review process. During deployment to our test environment the same morning no problems were detected and at 15:46 the team initiated the deployment to Kundo's production environment.

The deployment included a job that changed how data was stored in one of our database systems, something that in the test environment went more or less unnoticed due to its limited database size. When the job was started in the production environment however the database was overloaded and execution time for requests increased immediately.

After the cause of the incident was understood, it was decided to let the on-going change complete (estimated remaining time was less than 10 minutes), but to postpone scheduled related changes to be done by the job - the changes were not strictly necessary for restoring system operability.

During the time of the incident, job queues with tasks to be completed asynchronously (e.g. delivery of inbound email to be visible in the Dashboard) started to build up. From completion of database-change to normal operation the systems required a period of about 7 minutes.

‌

We have identified several actions that will mitigate impact of failures like these in future, among which are:

Improve our internal process for reviewing database changes. In retrospect we identified that this particular change could have been flagged as compute-intensive and could have been altered to avoid system overload.
Initiate two projects looking into how we can improve at informing users on incident status and to forward users to our status page to a higher degree.

‌

Further details or questions

You are most welcome to contact us via email for more information: support@kundo.se

Posted Oct 19, 2022 - 10:50 CEST

Resolved

All systems are now operational and system load has stabilized. Details on the outage will be posted here on our status page after our review process has been finalized

Posted Oct 05, 2022 - 16:44 CEST

Monitoring

The immediate problem has now been solved and our systems are gradually returning to normal again

Posted Oct 05, 2022 - 16:36 CEST

Update

We are continuing to work on a fix for this issue.

Posted Oct 05, 2022 - 16:21 CEST

Update

Parts of our systems are now available again but with slight reduced performance. Kundo Mail is still affected by the disturbance and we are working on this as well as scaling up all other system capacities

Posted Oct 05, 2022 - 16:20 CEST

Identified

The problem has been identified to a newly introduced change that affects parts of our database schemas. Work has been initiated to reduce database load

Posted Oct 05, 2022 - 16:09 CEST

Investigating

We are currently troubleshooting a problem causing the Dashboard to be slow or temporarily unreachable

Posted Oct 05, 2022 - 15:57 CEST

This incident affected: Kundo, Dashboard, Mail, Chat, Calls, Forum, and Statistics.