What happened?
A set of improvements have been undertaken on the platform in the last weeks to absorb the high volume of traffic generated by the stores associated with the campaigns period of Ramadan. A subset of these changes has been done on one of the critical components use to accelerate the speed of showing the website to end consumers. This component is called Caddy. This is used for 2 main reasons, to control unwanted traffic to the platform and filter it and also cache the stores so that people that are browsing the same store can be shown with the different pages faster (an acceleration of around 200% compared to not using this component).
At 9:43:57 pm on 4th April 2024 Caddy scaled up failed. This issue caused intermittently some stores to show errors to end consumers not allowing them to continue browsing and forcing them to refresh.
How long it latest for?
6 minutes and 2 seconds
At 9:57:59 pm on 4th April 2024 all pods went back to work effectively and the issue was resolved.
Did it affect orders?
How are we preventing from this happening again?