Performance issue

Incident Report for Salesdock

Postmortem

In the first 2 weeks of January Salesdock faced a couple of - mostly short - outages. But since we had more than a handful of these short outages, we considered this to be a big incident. Since transparency and accountability is part of the Salesdock DNA, we wanted to let you know what went wrong and what we did and will do to prevent this from happening.

First of all we are very sorry for this incident to happen. We are very focused on delivering a high performing stable platform, and in the beginning of January we didn’t meet our own expectations in terms of stability. We apologize for that.

For this incident we have to go back to 25th of October ‘23. We experienced a downtime of 10 minutes. Based on research of ourselves and that of our hosting provider Shock Media there seemed to be a problem with our queueing system. Inserting jobs into our queue system came randomly and unexpectedly to a halt. Normally these operations are extremely fast, since by nature the queue only holds records for a short amount of time.

Because these queries wouldn’t finish, the number of connections increased, and eventually exceeded the number of allowed database connections, causing the application to not being available anymore.

Unfortunately we couldn’t tell why these problems were happening with the little information we had of the disruption. It seemed random, and we couldn’t explain it. Feedback from our hosting partner seemed to point to a possible problem inside the application. But the queue system is part of a sophisticated framework that we use. The queue system is battle tested and in the 5 years of Salesdock we never faced a problem with it that looked like this particular issue. We ended up restarting the database, which resolved the problem. We were up, but we had no clue what caused the issue.

Since we never experienced this issue before, and we didn’t had any more information, there wasn’t much we could do. We implemented a bit of extra logging, so that if it ever happened again, we would have more information and context. Two months passed without any incidents, until the 2nd of January 2024. The year had just started, and Salesdock encountered the same problem. We quickly contacted Shock Media. We saw the same symptoms. While Shock Media was doing research, we also started to dig deeper. Eventually we decided to again restart the database server. This resolved the problem. After that incident we kept on digging deeper, as this incident now happened for a second time. We were sure it would happen again, so we had to spend more effort in finding the root cause. We had our top engineers investigating the issue, but finding the root cause kept being a problem. We decided to make some improvements to our jobs system, and implemented additional logging. Also Shock Media made some configuration changes to the database. In the hope that this would resolve the issue. On 11 January we again experienced the same problem again, and it looked liked the frequency was increasing. The improvements that we did to the application unfortunately hadn’t resolved the problem. Also on 15 and 16th of January we had a couple of short hiccups, caused by Shock Media needed to restart the database to prevent new ‘max connection’ problems for our users.

After doing more research we more and more got the feeling that the number of jobs that we were processing were hitting new highs, and because of that all kinds of random problems were happening; mostly ‘database lock’ problems. We came to the conclusion that using the database wasn’t the best solution (anymore) for a queue system for a high concurrency platform like Salesdock where we process a couple of thousand incoming requests each minute. The great adoption of users of our workflow automation module also accelerated the amount of jobs we came to process. Luckily we were already working on a database replacement for our queue system. But because of other priorities, we hadn’t finished the project. We of course didn’t expect this to cause any issues, but we were clearly wrong with that judgement. On Tuesday night the 16th we decided to swap our queue database for Redis, for storing jobs. We finished the project in a couple of hours, did some thorough testing, and moved to Redis at 22 PM. Since a big part of the NL team was in India, the time in India was 2.30 am of the release. We were awake that night, because we had a night flight back to Dusseldorf. During the flight we stayed in contact with our team in India, since we were very curious finding out if our solution resolved the problem for ones and for all. Besides this big change, we also implemented some other improvements that prevented unnecessary jobs to be spawn. While our customers woke up, and traffic peaked we didn’t see any new incidents. Eventually the entire day stayed calm and incident free. Also the day after everything was perfectly stable.

The learnings that we had was to never use a relational database for a queue system in a high concurrency platform like ours on production. If you have any questions regarding this incident, don’t hesitate to reach out to us!

Posted Jan 23, 2024 - 12:40 CET

Resolved

We hebben een fix geïmplementeerd en zullen alles nauwlettend in de gaten houden.

Posted Jan 17, 2024 - 05:05 CET

Update

De afgelopen dagen hadden we helaas last van problemen, waardoor we een paar keer onze database hebben dienen te herstarten (met als gevolg korte downtime). De problemen hebben uiteraard onze volle aandacht, en er is gisteren en vandaag hard gewerkt aan het vinden van de oorzaak van deze problemen. Vanavond om 22:00 uur zullen we daarom extra onderhoud plegen, waarmee we verwachten de problemen op te zullen gaan lossen. Door dit onderhoud kan het zijn dat we maximaal 15 minuten offline zijn.

Posted Jan 16, 2024 - 19:52 CET

Monitoring

We hebben te maken met verminderde prestaties en monitoren het probleem

Posted Jan 16, 2024 - 10:55 CET

This incident affected: API, SMS, Agent portaal, Admin portaal, E-mails, Contract generatie, Integraties, Outside, and Exports.