Partial and intermitent outage of stores registered 6m 2s
Incident Report for Zid
Postmortem

What happened? 

A set of improvements have been undertaken on the platform in the last weeks to absorb the high volume of traffic generated by the stores associated with the campaigns period of Ramadan. A subset of these changes has been done on one of the critical components use to accelerate the speed of showing the website to end consumers. This component is called Caddy. This is used for 2 main reasons, to control unwanted traffic to the platform and filter it and also cache the stores so that people that are browsing the same store can be shown with the different pages faster (an acceleration of around 200% compared to not using this component). 

At 9:43:57 pm on 4th April 2024 Caddy scaled up failed. This issue caused intermittently some stores to show errors to end consumers not allowing them to continue browsing and forcing them to refresh. 

How long it latest for? 

6 minutes and 2 seconds

At 9:57:59 pm on 4th April 2024 all pods went back to work effectively and the issue was resolved. 

Did it affect orders? 

  • Ratio of created orders in the next 15 min after the issue happened is higher by 10 or 15 orders more than average per minute (before the issue occurring). This suggest that all orders were not lost considering the traffic after the issue and the orders after the issue.

How are we preventing from this happening again? 

  • Review proactively the Caddy configuration to ensure we don’t have bigger issues like the one reported in the future and ensure that all issues are resolved in the Caddy gateway 
  • We have added additional monitoring however this was a hardware failure that we continuing to investigate with hosting supplier in cloud.
Posted Apr 07, 2024 - 13:37 GMT+03:00

Resolved
Some stores were not opening and showing error messages shown to their customers from 9:43:57 pm to 9:57:59 PM. The issue caused the platform to be unstable and it was related to a our gateway cluster used to secure the traffic and cache the stores to ensure faster response to end consumers.
Posted Apr 05, 2024 - 21:30 GMT+03:00