Friday, December 10, 2021

Software Resilience

 

This week blog is based on my work experience in AWS.  On Tuesday 7 Dec 2021, our products got affected due to AWS multi hours global outage. 

The first phase of outage began at approximately 15:35 UTC (7:35 am PT), when multiple Amazon sites and services began to show significant performance degradation. While site loading appeared to mostly normalize by 16:50 UTC (8:50 am PT), we observed AWS API service failures that caused API transactions to experience dramatically higher completion times or simply time out.

Second wave of the outage lasted for over 7 hours, not fully resolving until approximately 0:44 UTC (4:44 pm PT) . We managed our products with Disaster Recovery (DR) strategy.

Amazon confirmed that service issues with AWS main US-East-1 region, located in Northern Virginia, were causing problems for its warehouse and delivery network.

On reading the various industry news, it was an opportunity to learn AWS influences in the industry.

AWS controlled 33% of the global cloud infrastructure market in the second quarter, according to Synergy Research Group, followed by Microsoft at 20% and Google at 10%.

As the result, key industry observations are

  • brought down many popular websites and services.
  • some of Amazon’s delivery operations ground to a halt, and third-party sellers couldn’t ship products.
  • Colleges that rely on software to host content had to postpone exams during finals week.

As software engineer, personal learning/improvement (recollecting golden words of my mentor Suresh) - resilience is to re-adapt to any crisis situation intelligently


No comments:

Post a Comment