On 18th June, 2024, we received a barrage of calls from approximately 500 users complaining that iygeal.com was inaccessible. Our robust monitoring system had alerted about this almost concurrently. This unfortunate incident lasted for about 2 hours before we were able to resolve it.
Issue Summary
Duration of the Outage: June 18, 2024, 14:00 WAT to 16:00 WAT (2 hours)
Impact:
- The e-commerce website was completely
inaccessible.
- Users experienced 404 errors when trying to access
any page.
- 100% of active users were affected.
Root Cause: A missing configuration file in the deployment package
caused the application to fail during initialization.
Timeline
- 14:00 WAT: Issue detected by automated
monitoring system alerting about a sudden spike in 404 errors.
- 14:05 WAT: Incident response team headed
by Loay were notified via Slack.
- 14:10 WAT: Initial investigation by
on-call engineer focused on potential web server misconfigurations.
- 14:15 WAT: Engineer Innocent used tmux to create two terminal instances for simultaneous debugging.
- 14:20 WAT: Ran curl -sI 127.0.0.1 on one terminal to test the local server response, which
showed a 500 Internal Server Error.
- 14:25 WAT: Attached strace to the Apache process using sudo strace -p <apache_pid> on the second terminal to trace system calls
and signals.
- 14:30 WAT: strace output revealed an attempt to open /var/www/html/wp-includes/class-wp-locale.phpp, which resulted in an ENOENT (No such file or
directory) error.
- 14:40 WAT: Misleading path: Assumed the
issue was due to a misconfigured database connection.
- 15:00 WAT: Escalated to the DevOps team to
verify the deployment process.
- 15:10 WAT: DevOps team confirmed the typo
in the filename (.phpp instead of .php).
- 15:20 WAT: Developed a Puppet script to
automate the correction of .phpp to .php in the deployment files.
- 15:30 WAT: Deployed the Puppet script,
which scanned the affected directory and corrected the file extension.
- 15:40 WAT: Retried the deployment with the
corrected files.
- 16:00 WAT: Status code of 200 indicated
system is fully operational, users confirmed site accessibility.
Detailed Root Cause and Resolution
Root Cause
The root cause of the outage was a typo in the deployment
package where a critical file was named class-wp-locale.phpp instead of class-wp-locale.php. This incorrect filename caused the application to fail
during initialization, leading to 404 errors across the site.
Resolution
The issue was identified using strace to trace
system calls and detect the incorrect filename. A Puppet script was then
developed and deployed to automate the correction of the typo in the deployment
files. Once the deployment was retried with the corrected files, the
application started successfully, and the site became accessible to users.
Corrective and Preventative Measures
Improvements:
- Implement automated checks to verify the
completeness and correctness of deployment packages before deployment.
- Enhance monitoring to include checks for critical
configuration files and common file naming conventions.
- Update the deployment process to include a
pre-deployment verification step.
- Improve incident response procedures to utilize
debugging tools like strace, tmux, and curl more effectively.
To-Do's:
1. Patch Deployment Script: Update the deployment script to
include a verification step for critical files and naming conventions.
2. Add Monitoring: Implement file existence monitoring
for critical configuration files and extensions.
3. Review Deployment Checklist: Revise the deployment checklist to
ensure all necessary files are correctly named and included.
4. Conduct Training: Train the deployment team on the
updated process, including the use of debugging tools like strace, tmux, and curl.
5. Post-Mortem Review: Schedule a review meeting to discuss
the incident and the new measures with the entire engineering team.
By taking these measures, we aim to prevent similar issues from occurring in the future and improve our overall deployment and monitoring processes.
No comments:
Post a Comment