We apologize for any inconvenience caused by this incident. A summary of the events and measures taken follows.
A security update caused unexpected performance issues, causing the Kundo dashboard to be unavailable for many users.
15:47 - An alarm about reduced performance in our systems is triggered.
15:50 - Our incident process is initiated and troubleshooting is started.
15:57 - Information about the incident is published on status.kundo.se
16:11 - The root cause of the incident is identified to be caused by a change to one of our database schemas - an important change related to an upgrade of one of our core services that serves among other features Kundo's Dashboard.
16:12 - Database load is reduced by scaling down consumer services. The on-going database change is monitored and information is gathered to allow for informed decision making. Performance is reduced in Kundo's core services, including the Dashboard, and users experience the Dashboard as slow or even completely unresponsive.
16:24 - Resources are added for serving web requests to the Dashboard to reduce user impact.
16:34 - The database change is completed and the system starts to return to normal.
16:41 - All systems are back to normal load and Incident status is set to Resolved
17:07 - All customers that have contacted Kundo by email are informed about the incident resolution.
One of our engineering teams prepared for an upgrade of one of our web frameworks - a necessary upgrade that is required to among other things maintain a high level of security. The upgrade was prepared the days before the incident and planned changes had passed through our internal review process. During deployment to our test environment the same morning no problems were detected and at 15:46 the team initiated the deployment to Kundo's production environment.
The deployment included a job that changed how data was stored in one of our database systems, something that in the test environment went more or less unnoticed due to its limited database size. When the job was started in the production environment however the database was overloaded and execution time for requests increased immediately.
After the cause of the incident was understood, it was decided to let the on-going change complete (estimated remaining time was less than 10 minutes), but to postpone scheduled related changes to be done by the job - the changes were not strictly necessary for restoring system operability.
During the time of the incident, job queues with tasks to be completed asynchronously (e.g. delivery of inbound email to be visible in the Dashboard) started to build up. From completion of database-change to normal operation the systems required a period of about 7 minutes.
You are most welcome to contact us via email for more information: support@kundo.se