SaaS

This article is more than 1 year old

Google broke its own cloud by doing two updates at once

Right hand, meet left hand. Then both do a facepalm

Wed 24 Aug 2016 // 01:15 UTC

Google has explained an August 11th brownout on its cloud as, yet again, a self-inflicted wound.

At the time of the incident Google said App Engine APIs were unavailable for a time.

It's now saying the almost-two-hour incident meant “18% of applications hosted in the US-CENTRAL region experienced error rates between 10% and 50%, and 3% of applications experienced error rates in excess of 50%. Additionally, 14% experienced error rates between 1% and 10%, and 2% experienced error rate below 1% but above baseline levels.”

Users also wore “a median latency increase of just under 0.8 seconds per request.”

Google's now revealed the root cause of the accident, which started with “... a periodic maintenance procedure in which Google engineers move App Engine applications between datacenters in US-CENTRAL in order to balance traffic more evenly.”

When Google does this sort of thing “... we first move a proportion of apps to a new datacenter in which capacity has already been provisioned. We then gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources. The applications running on the drained servers are automatically rescheduled onto different servers.”

All of which sounds entirely sensible.

But while Google was draining the pool on this occasion, “a software update on the traffic routers was also in progress, and this update triggered a rolling restart of the traffic routers. This temporarily diminished the available router capacity.”

“The server drain resulted in rescheduling of multiple instances of manually-scaled applications. App Engine creates new instances of manually-scaled applications by sending a startup request via the traffic routers to the server hosting the new instance.”

Some of those manually-scaled instances started up slowly “resulting in the App Engine system retrying the start requests multiple times which caused a spike in CPU load on the traffic routers. The overloaded traffic routers dropped some incoming requests.”

Google says it had enough routing capacity to handle the load, but that the routers weren't expecting all those retry requests. And so its cloud browned-out.

The Alphabet managed to rollback and restore services and now promises that “In order to prevent a recurrence of this type of incident, we have added more traffic routing capacity in order to create more capacity buffer when draining servers in this region.”

“We will also change how applications are rescheduled so that the traffic routers are not called and also modify that the system's retry behavior so that it cannot trigger this type of failure.”

No mention, however, of trying to schedule upgrades so it is only doing one at a time. ®

Topics

Special Features

Vendor Voice

Resources

SaaS

Google broke its own cloud by doing two updates at once

Right hand, meet left hand. Then both do a facepalm

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Google will delete data collected from 'private' browsing

Google One VPN axed for everyone but Pixel loyalists ... for now

Google joins the custom server CPU crowd with Arm-based Axion chips

Industrial systems integrating digitalisation

Google location tracking deal could be derailed by politics

Google sues app devs, claims they're Play Store crypto scammers with 100k+ victims

Google will pump more than $100B into AI, says DeepMind boss

Japan turns up heat on Apple, Google with threat of hefty fines

Google bakes new cookie strategy that will leave crooks with a bad taste

AI spam is winning the battle against search engine quality

Google plunks down $1 billion for extra Japan-US submarine cable

Google ponders making AI search a premium option

About Us

Our Websites

Your Privacy