Thereâs a lot of wild guessing and chatter about the AWS outage, often to suit a narrative.
Where I differ from the resilience narratives is that I donât believe the software system itself is complex. I think the problem that AWS suffered from was a complicated problem - hence the quick resolution and root cause analysis. The complexity lies in the human systems that place value on different things.
It is possible to build intricate software with incredibly low rates of failure - in medical, nuclear, aviation branches, this is routine. But it has an attached cost. Even then, the complexity of human systems means that the human system can shift to a new state where the previously reliable software suddenly becomes dangerous. Vigilance is always necessary and outages are always a few steps away.
AWS isnât offering extremely low failure rates, itâs offering very low failure rates. It would make the product too expensive otherwise. The whole point of cloud is to be cheap at scale. AWS has millions of customers all with different needs around reliability. This requires a very high level of flexibility in the architecture. Making it more reliable means constraining the system - but cloud customers are buying the flexibility that this constraint would kill. The balance between flexibility and reliability is called criticality. Cloud providers are aiming for criticality across millions of customers - which is very different than perfection. As these cloud platforms move toward ever greater reliability they will do so at the cost of calcification - see mainframe systems for an example of how that looks. So the onus falls on us to build systems that can cope with failures to preserve the criticality and low cost of the underlying cloud infrastructure and benefit from the economies of scale.
So the complicated problems that start to build up in these kind of systems are priced in. There isnât time or money to identify and work through every single issue - so we live with a little risk and invest in incident response to put the fires out. The right investment level will suffer a few outages but not too many, and have measures in place to try and only suffer a particular outage once.
Of course when customers are upset you canât tell them this.
So all the people mocking AWS - their system is doing exactly what itâs meant to be doing. This will happen again, to all cloud providers, and if you think youâre better than them try and replicate what theyâve done - youâll find it extremely difficult.
Reliability and resilience are the job of the software architect in each individual project. Itâs possible to build this on top of platforms that arenât completely resilient out of the box. This is also why you canât import FAANG narratives about resilience into your project, itâs two different worlds. Itâs really not fair to point fingers and mock AWS, itâs a misplaced expectation.