Verizon’s plans to shutter its public cloud on April 12, 2016, are making headlines, but it’s not the first time the company’s cloud service has been in the news. The post below, first published in January 2015, discusses a mysterious service outage that took place then.
With the closing of Verizon’s public cloud (private services are still available) and HPE’s Helion in 2015, the market seems to be shaping favorably for the big players — Amazon Web Services, of course, being the top dog. This chart, first posted in our end-of-year reflection on cloud services, shows the dramatic lead AWS has in the market.
(Editor’s note: This post first published January 21, 2015.)
You may recall that I recently wrote about Verizon taking its cloud down. Well, it’s back now, but we still don’t know why it was down for about 40 hours.
This really is weird with a capital W. To its credit, I guess, the Verizon Cloud was down for “only” 40 hours. This was for planned maintenance to provide “seamless upgrade functionality as well as other customer-facing updates.”
I’m not sure what that means.
Verizon “explained” that this would enable the company to “conduct major system upgrades without interrupting service or limiting infrastructure capacity. Traditionally, updates have been made via rolling maintenance and other methods. Many cloud vendors require customers to set up virtual machines [VM] in multiple zones or upgrade domains, which can increase the cost and complexity. Additionally, those customers must reboot their virtual machines after maintenance has occurred.”
Well, yes, but the last I checked none of the other cloud providers have had to take their services down for almost two days to do any of those things.
In addition, the last I checked, Verizon is still using a Linux/Xen-based platform for its cloud, just like Amazon Web Services (AWS), and, gosh, while AWS has certainly failed from time to time they’ve never had to take down their entire system for 40+ hours.
Indeed, that’s the whole point of a cloud – that you can do rolling upgrades so that the entire system never need go down for maintenance. True, the latest version of Xen, Xen 4.5, includes Coarse-grained Lock-stepping (COLO). With COLO you can replicate the state of a primary VM (PVM) on demand to a secondary VM (SVM) on a different physical system. In short, with this you can provide non-stop VM services by enabling near-instantaneous local and remote recovery from a failed VM.
But, ahem, Xen 4.5, was released after the Verizon Cloud upgrade. And, even, if Verizon took its cloud technology life into its own hands by adopting a Xen 4.5 release candidate, COLO is still a work in progress. It’s not ready for prime time. In short, there will still be times when you’ll need to reboot your VM after an upgrade.
Besides, while there’s no doubt it’s more expensive to maintain VMs in separate zones, that’s the price you have to pay for reliability. Even the best single cloud data-center in the world can still be knocked out by a natural disaster.
In short, I really don’t see a good explanation for exactly why Verizon had to be take their cloud down for so long. If I were a Verizon Cloud customer, I would really want to know in specific detail what really was happening behind the scenes.
I fear that Verizon’s engineers simply weren’t up to the job of upgrading their systems except by reaching back into their 2000s technology bag of tricks because they’re not up to managing their cloud infrastructure properly.
If that’s the case, Verizon won’t be the last. A sad and simple truth is that we have nowhere near enough cloud-savvy architects, designers, and administrators to go around. I fully expect more cloud problems to happen in 2015 because of administration error than any real technology problem.