AWS Outage: Lessons Learned

Recently, we experienced a significant AWS outage that affected many businesses for several hours… It served as a reminder of the importance of having a robust disaster recovery plan in place. I’m curious to hear how others are preparing for potential outages and what tools you’re using to improve system resiliency.

‌⁠‍⁠​‍​‍‌⁠‌​​‍​‍​⁠‍‍​‍​‍‌‍​⁠‌‍⁠​‌‍⁠⁠‌⁠‌‌‌‍‌​‌‍​⁠‌‍⁠⁠‌‍⁠‌‌⁠​​‌⁠‌‌‌⁠‌​‌‍‍‌‌‍⁠‍‌‍‌⁠​‍​‍​‍⁠​​‍​‍‌‍‍⁠​‍​‍​⁠‍‍​‍​‍‌‍⁠‍‌‍‌‌‌⁠‌⁠‌‌⁠⁠‌⁠‌​‌‍⁠⁠‌⁠​​‌‍‍‌‌‍​⁠​‍​‍​‍⁠​​‍​‍‌‍‍‌‌‍‌​​‍​‍​⁠‍‍​‍​‍‌‍⁠‍‌‍‌‌‌⁠‌⁠​‍​‍​‍⁠​​‍​‍‌‍‌​​‍​‍​⁠‍‍​‍​‍​⁠​‍​⁠​​​⁠​‍​⁠‌‍​⁠​​​⁠‌‍​⁠​‌​⁠​⁠​‍​‍​‍⁠​​‍​‍‌‍‍​​‍​‍​⁠‍‍​‍​‍​⁠‍‌‌‌‍‌‌⁠‍‍‌​​‍‌‌‍‌‌‍‍⁠‌​‌‍‌​​⁠‌‍‌​‌‌⁠⁠‌‍‍​‌‍⁠​‌​​⁠‌​⁠‌‌‍‍‌​⁠‍​​‍​‍‌⁠⁠‌

I totally get the need for a solid disaster recovery plan. Last year during an outage, we started using AWS Lambda for automated failovers, which really helped us minimize downtime. It’s not foolproof though; always keep an eye on your alerting systems, or you might miss critical events during those hours of chaos.

‌⁠‍⁠​‍​‍‌⁠‌​​‍​‍​⁠‍‍​‍​‍‌‍​⁠‌‍⁠​‌‍⁠⁠‌⁠‌‌‌‍‌​‌‍​⁠‌‍⁠⁠‌‍⁠‌‌⁠​​‌⁠‌‌‌⁠‌​‌‍‍‌‌‍⁠‍‌‍‌⁠​‍​‍​‍⁠​​‍​‍‌‍‍⁠​‍​‍​⁠‍‍​‍​‍‌⁠​‍‌‍‌‌‌⁠​​‌‍⁠​‌⁠‍‌​‍​‍​‍⁠​​‍​‍‌‍‍‌‌‍‌​​‍​‍​⁠‍‍​⁠‌‌​⁠​⁠​⁠‍​​‍⁠​​‍​‍‌‍‌​​‍​‍​⁠‍‍​‍​‍​⁠​‍​⁠​​​⁠​‍​⁠‌‍​⁠​​​⁠‌‍​⁠​‌​⁠‌‍​‍​‍​‍⁠​​‍​‍‌‍‍​​‍​‍​⁠‍‍​‍​‍‌​⁠⁠​⁠​⁠‌⁠‌‌‌​‌‍‌​‍‌​⁠‌​‌‍‍​‌⁠​​‌⁠​‌‌​‌‍​⁠‌⁠​⁠‍‌‌​‌⁠​⁠‌​​⁠‍‌‌‌​​​‍​‍‌⁠⁠‌