Thursday, February 1, 2018

From TechCrunch, Jan. 27:

Move slow and break nothing
Facebook Messenger was down for me for about an hour earlier this week. My MacBook Pro randomly kernel panics overnight and restarts. Slack was down, and Github, and AWS. A little more than a year ago, Dyn went down, throwing the DNS layer of the internet into a tailspin. Practically every chip made by Intel has serious security flaws. Equifax leaked 143 million accounts. Tokyo-based Coincheck lost over $400 million in tokens due to hackers.

If software is eating the world, then that might explain why everything seems so ridiculously broken these days.

It’s easy to just blame companies, or hackers, or software engineers, and it’s just as easy to just give up and believe that nothing is going to get better and revert to a pre-agrarian society. What we have is a real crisis in reliability, not just across software, but across our entire society. Even the U.S. government had some serious downtime this week.

What’s going on is that we have greatly increased the magnitude of complexity of our society’s systems, even as we couple them more tightly together. Charles Perrow, a sociology professor at Yale, described the combination of these two as “normal accidents” in an eponymous book. It’s an oxymoronic term for a very intelligent observation: that what we think of as “accidents” or crashes or bugs are really quite common and indeed, inevitable, given the design of systems that we rely on.

Complex systems are ones in which changes, even small ones, can have disproportionate effects on the outcome of a system. Take last year’s downtime of S3, Amazon’s storage layer. According to Amazon’s after action report: “an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

Amazon has fixed the issue and put in new safeguards to make sure such a change can’t happen again. That’s fantastic, and Amazon should be lauded for writing up and disclosing a comprehensive report on the error. But this was a “normal accident” — the sheer complexity of Amazon’s services means that the surface area of things that can go wrong is practically infinite.

On top of complexity, tight coupling means that various independent parts of a system are designed to work closely together. When S3 went down, it knocked out a bunch of major websites, because websites had no backup or redundancy in the event that Amazon’s services were not working. That is, except for Netflix, which had developed redundancies in its infrastructure to ensure that the failure of any individual component would not bring down the entire system.

Everything about our modern world has increased complexity and how tightly coupled our systems are. Take software development itself. The (usually) clearly designed APIs and libraries of the host operating system have been replaced by a ghastly and constantly evolving collection of libraries and web frameworks, a palimpsest of code and hope...