We apologize for any inconvenience caused by this incident. A summary of the events and measures taken follows.
Kundo was the target of a large scale DDoS (distributed denial of service) attack causing performance issues and unavailability intermittently throughout the day for many users.
Timeline (in CET)
08:14 - An alarm about reduced performance in our systems is triggered.
08:17 - Resources for serving web requests are added to reduce user impact
08:21 - Our incident process is initiated and troubleshooting is started by the whole development team.
08:32 - We identified the target of the attack.
08:32 - The root cause of the incident is identified to be caused by an unusual large amounts of requests to our servers.08:33 - Information about the incident is published on status.kundo.se
09:17 - Further resources for serving web requests are added.
09:18 - We start blocking requests from identified attackers.
09:39 - Performance is improved for the majority of the users but even further resources for serving web requests are added.
10:19 - Further resources are added to the servers to stabilize the servers
10:29 - All systems are confirmed to be fully available and Incident status is set to Resolved
13:27 - We’re informed about of general unavailability of Kundo
13:29 - Our incident process is initiated and troubleshooting is started.
13:37 - Information about the incident is published on status.kundo.se
13:39 - Resources for serving web requests are added
13:49 - All systems are confirmed to be fully available and Incident status is set to Resolved
14:57 - An alarm about reduced performance in our chat service is triggered.
14:57 - Our incident process is initiated and troubleshooting is started.
15:05 - A caching improvement of our chat is released
15:06 - A caching configuration in the web servers are made
13:49 - All systems including the Chat are confirmed to available and Incident status is set to Resolved
Kundo were the target of a DDoS attack flooding the web servers with requests bringing the processing of requests to a near halt. The large amount of requests made other parts of the system unstable. We already had several DDoS mitigations as well as auto scaling of the performance of our servers in place that normally handles this kind of situation automatically. In this case, those precautions proved inadequate, which resulted in down time over a period of time.
To mitigate this problem the capacity of the servers were increased and the malicious requests from the attackers were blocked. The increased capacity posed some difficulties for the system set up and required some manual intervention to succeed. The blocked requests still put some load on the system and made the system struggling to recover.
A series of improvements of the caching, increased server capacity and some manual handling finally made the system available again.
We have identified several actions that will mitigate impact of failures like these in future, among which are:
Several of these have already been implemented at the date of the publication of this post mortem.
You are most welcome to contact us via email for more information: support@kundo.se