Saturday 22 June 2024

Postmortem of Debugging a Web Outage of iygeal.com, an E-commerce Service/Website

On 18th June, 2024, we received a barrage of calls from approximately 500 users complaining that iygeal.com was inaccessible. Our robust monitoring system had alerted about this almost concurrently. This unfortunate incident lasted for about 2 hours before we were able to resolve it.  

Issue Summary

Duration of the Outage: June 18, 2024, 14:00 WAT to 16:00 WAT (2 hours)

Impact:

  • The e-commerce website was completely inaccessible.
  • Users experienced 404 errors when trying to access any page.
  • 100% of active users were affected.

Root Cause: A missing configuration file in the deployment package caused the application to fail during initialization.

Timeline

  • 14:00 WAT: Issue detected by automated monitoring system alerting about a sudden spike in 404 errors.

  • 14:05 WAT: Incident response team headed by Loay were notified via Slack.
  • 14:10 WAT: Initial investigation by on-call engineer focused on potential web server misconfigurations.
  • 14:15 WAT: Engineer Innocent used tmux to create two terminal instances for simultaneous debugging.
  • 14:20 WAT: Ran curl -sI 127.0.0.1 on one terminal to test the local server response, which showed a 500 Internal Server Error.
  • 14:25 WAT: Attached strace to the Apache process using sudo strace -p <apache_pid> on the second terminal to trace system calls and signals.
  • 14:30 WATstrace output revealed an attempt to open /var/www/html/wp-includes/class-wp-locale.phpp, which resulted in an ENOENT (No such file or directory) error.
  • 14:40 WAT: Misleading path: Assumed the issue was due to a misconfigured database connection.
  • 15:00 WAT: Escalated to the DevOps team to verify the deployment process.
  • 15:10 WAT: DevOps team confirmed the typo in the filename (.phpp instead of .php).
  • 15:20 WAT: Developed a Puppet script to automate the correction of .phpp to .php in the deployment files.
  • 15:30 WAT: Deployed the Puppet script, which scanned the affected directory and corrected the file extension.
  • 15:40 WAT: Retried the deployment with the corrected files.
  • 16:00 WAT: Status code of 200 indicated system is fully operational, users confirmed site accessibility.

Detailed Root Cause and Resolution

Root Cause

The root cause of the outage was a typo in the deployment package where a critical file was named class-wp-locale.phpp instead of class-wp-locale.php. This incorrect filename caused the application to fail during initialization, leading to 404 errors across the site.

Resolution

The issue was identified using strace to trace system calls and detect the incorrect filename. A Puppet script was then developed and deployed to automate the correction of the typo in the deployment files. Once the deployment was retried with the corrected files, the application started successfully, and the site became accessible to users.

Corrective and Preventative Measures

Improvements:

  • Implement automated checks to verify the completeness and correctness of deployment packages before deployment.
  • Enhance monitoring to include checks for critical configuration files and common file naming conventions.
  • Update the deployment process to include a pre-deployment verification step.
  • Improve incident response procedures to utilize debugging tools like stracetmux, and curl more effectively.

To-Do's:

1. Patch Deployment Script: Update the deployment script to include a verification step for critical files and naming conventions.

2. Add Monitoring: Implement file existence monitoring for critical configuration files and extensions.

3.     Review Deployment Checklist: Revise the deployment checklist to ensure all necessary files are correctly named and included.

4.  Conduct Training: Train the deployment team on the updated process, including the use of debugging tools like stracetmux, and curl.

5.   Post-Mortem Review: Schedule a review meeting to discuss the incident and the new measures with the entire engineering team.

By taking these measures, we aim to prevent similar issues from occurring in the future and improve our overall deployment and monitoring processes.