Incident Report - April 16

Chatwoot experienced an incident on April 16 which lasted from 01:15pm to 03:24pm UTC. A spike in requests led to degraded services, causing some message events to be missed. This issue was not system-wide and impacted less than 1% of our active accounts, with fewer than 10 conversations affected in the majority of these accounts.

Our sincerest apologies for the disruption. Reliability is the top priority for us at Chatwoot. We have identified the risks and have taken steps to mitigate such events in future.

Timeline

All times are in Coordinated Universal Time (UTC)

April 16, 2024

1:15 PM: The high throughput alert (P1) for Sidekiq began, and the team started investigating the incident.
1:21PM: The queue was not draining quickly enough, so the incident was escalated to P0. Queue latency exceeded 5s.
1:40PM: The team identified the account causing the issue, initially attributing it to high website traffic.
2:00PM: More workers were added to process the queues faster, but new jobs continued to be added rapidly.
2:36 PM: The developer from the problematic account contacted us, revealing a bug in their system that caused a loop in the API calls.
3:08PM: The account causing the issue was suspended.
3:24PM: The system returned to stability.
4:15PM: Some users reported missing messages in their conversations.
5:00PM: The team discovered some jobs had gone missing during the incident.

April 17, 2024

The data lost during the event has been recovered. Our team has alerted the affected accounts describing the steps to share the lost data and the incident was resolved.

What happened

We observed an increase in message creation requests, resulting in a backlog of background jobs. Although we tried to address this by adding more workers to process the queue, the backlog kept growing.

We found that most of the background jobs were generated by a single account. We initially speculated that high website traffic and volume of conversations were to blame. However, we later identified a bug in the customer's system triggering the surge in message creation.

This backlog resulted in a delay in message delivery, with latencies exceeding 5 seconds. However, after blocking the requests from the account, we were able to restore system back to a stable state.

During our investigation into the incident, we discovered that some received messages were not created in Chatwoot. We have also received similar reports from customers. After examining the logs, we found a significant number of missing jobs. This was an unexpected behavior and the first time we saw such an issue.

We queued the background jobs using Sidekiq, which utilizes Redis. In our Redis DB configuration, we set the eviction policy as allkeys-lru. A surge in events resulted in a memory overflow, causing Redis to evict keys based on this policy. Due to this, we lost a lot of jobs which were yet to be executed.

We were able to recover all of the missing message from our logs. Our initial plan was to restore as many messages as possible to their original conversations. However, we discovered that new messages have already been added to most conversations. As a result, importing the missing messages could lead to confusion for those managing the conversation.

We notified the customers impacted by the incident and offered them with an option to get an excel export of the messages that were missed during the incident.

Next steps:

The incident allowed us to identify the shortcomings in our systems. As a result, we will be updating our systems as follows.

Rate limiting for authenticated accounts: We already had a rate limited for unauthenticated users, however we will be adding better rate-limiting constraints for authenticated users as well. This will include a limit on the number of messages a user can create within a specific time frame.
Moving the database backed background jobs: We will be working towards a more persistent job queues for data-sensitive operations.