Wednesday, October 19, 2016

Oops: Google Crashes It's Own Cloud (again)

When Salesforce's cloud crashed last May customers lost hours of data. Permanently.
Still a few bugs in the system.

From the Register:

Google's crash canaries' muted chirping led to load balancer brownout
Google has revealed that it broke its own cloud again, this time because of two failures: a software error and alerts that proved too hard to interpret.

The problem hit Google's cloudy load balancers on Thursday, October 13, causing them to produce HTTP 502 (Bad Gateway) responses. At first, two per cent reported the problem. But an hour and two minutes later, at 16:09 Pacific Time, 45 per cent were generating errors. Which made it rather hard to access virtual machines.

Google says its load balancers are “a global, geographically-distributed multi-tiered software stack which receives incoming HTTP(S) requests via many points in Google's global network, and dispatches them to appropriate Google Compute Engine instances” and that the problem started when “a configuration change was rolled out to one of these layers with widespread distribution beginning at 15:07.”

“This change triggered a software bug which decoupled second-tier load balancers from a number of first-tier load balancers. The affected first-tier load balancers therefore had no forwarding path for incoming requests and returned the HTTP 502 code to indicate this.”

Google says its networks incorporate protections “to prevent them from propagating incorrect or invalid configurations” but that these safeguards “were partially successful in this instance, limiting both the scope and the duration of the event, but not preventing it entirely.”

The Alphabet subsidiary's incident says its first layer of protection is “a canary deployment, where the configuration is deployed at a single site and that site is verified to be functioning within normal bounds.”

But while “the canary step did generate a warning … it was not sufficiently precise to cause the on-call engineer to immediately halt the rollout. The new configuration subsequently rolled out in stages, but was halted part way through as further alerts indicated that it was not functioning correctly. By design, this progressive rollout limited the error rate experienced by customers.”

Google's fix for the problem is more articulate canaries....MORE
As noted a couple weeks ago:
Econophysics: Or Why, When it Comes to Economics, We All Behave like Particles"
Where synchronization is going to get very interesting is when some critical mass of businesses migrate to cloud computing, say Amazon's Amazon Web Service, and someone takes down AWS.
Unlike the good old days where a computer problem put one company at risk you'll have dozens, hundreds or thousands of companies frozen, all their economic activity halted at the same time.
That's synchronization baby!
See also last week's "Cloud Computing: One 'hiccup' and 'boom' - Amazon Web Services is 'gone'--Cisco President (AMZN)".