Incident Report - 4 February 2026
Vishnu Narayanan
Published on
1 minute read
Chatwoot Cloud experienced an incident on February 4 which lasted approximately 10-12 minutes, from 12:48pm to 01:00pm UTC. During this time, all Chatwoot Cloud users were unable to access the platform. No data was lost during this incident.
Our sincerest apologies for the disruption. Reliability is the top priority for us at Chatwoot. We have identified the risks and have taken steps to mitigate such events in future.
Timeline
All times are in Coordinated Universal Time (UTC)
February 4, 2026
- 12:43 PM: Database instability began, connections started failing
- 12:48 PM: Service disruption began, team started investigating
- 12:58 PM: Root cause identified as storage exhaustion and Storage capacity increase initiated
- 1:00 PM: Storage scaling completed, service fully restored
What happened
We were preparing for a PostgreSQL version upgrade using AWS RDS blue-green deployments. A failed deployment was left in a pending state, which caused unexpected storage consumption over several days.
RDS blue-green deployments use logical replication internally. When the deployment failed and was not cleaned up, it retained a replication slot on the primary database. This prevented the database from cleaning up write-ahead log (WAL) files. Over 3 days, approximately 1 TB of WAL accumulated alongside our 1 TB of actual data, eventually hitting our 2 TB storage autoscaling ceiling.
When storage was exhausted, the database could no longer accept connections, causing the service disruption. Once we identified the issue, we immediately scaled our storage capacity and restored service.
Next steps
To prevent similar incidents, we are implementing the following changes:
- Proactive storage monitoring: We are adding alerts at multiple storage utilization thresholds (60%, 75%, 90%) to catch capacity issues before they become critical.
- Replication slot monitoring: We are implementing monitoring for database replication slots to detect orphaned slots that could cause WAL accumulation.
- Database maintenance runbooks: We are creating detailed runbooks for database upgrade procedures with mandatory cleanup steps when deployments fail.
- Infrastructure capacity review: We are reviewing storage limits and autoscaling configurations across all production systems to ensure adequate headroom.