[CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

news · April 1, 2016

-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA1

== What happened ==

On Wednesday February 24th, at 6pm UTC time, the DC hosting some of

the CentOS equipments used for various roles had suffered from

multiple electricity power outages. The facility was completely dark

for just under 2 hrs, and we were able to start recovering services by

8pm UTC. By midnight we had most services restored, by 2:00AM UTC Feb

25th we had all services restored.

That meant that the machines in those racks were running on batteries

(ups in the racks) but finally went down in an uncontrolled way due to

lack ot communication with that UPS.

Subsequent on Monday March 14th, we suffered another power outage in

the racks, this time due to a overload on the rack power circuits.

== Services that were impacted ==

- severity critical : mirrorlist.centos.org node (IPv6) went down

(while multiple mirrorlist.centos.org nodes for IPv4 nodes were still

online). That means that machines with only IPV6 connectivity couldn't

get yum to work to retrieve the list of nearest mirrors.

- severity medium : Our main buildservices queue management services

were down; note: this did not impact our ability to build, test and

deliver updates.

- severity medium : www.centos.org and www.centos.org/forums weren't

reachable through IPv6 : at the moment, those services are natively

reachable through IPv4, but proxied through nodes in that DC for IPv6

users. Most tested browsers were falling back to IPv4 during that period

- severity medium : CentOS DevCloud

(https://wiki.centos.org/DevCloud) : that means that CentOS Developers

weren't able to instantiate new CentOS test VMs for their work, but

also weren't able to reach the existing ones.

- severity low : several publicly facing small services like

http://planet.centos.org , http://seven.centos.org (not critical and

could be restored quickly to other VMs elsewhere)

- severity low : the server leading the armv7hl builds for the Plague

build farm was also offline, meaning no armhfp build during that

timeframe (but not updates were to be built, so mitigated issue)

= Followup actions and notes

Over the years, the baseline recovery model we've used and tried to

enforce is one of 'restore in place', take a downtime hit if needed -

and ensure we have service continuity for the user facing components (

the mirrorlist service, the centos update and content distribution

services). For other resources, like the main website etc, we ensure

there are good backups available in multiple places, usable to restore

services should there be a need. This model has worked well for us

over the years, and we've had very little, if any, service outages

that had a user impact. The restore in place/restore outside HA also

meant we were able to better utilise the exclusively sponsored

machines we rely on.

However, as the project grows, with a lot more infrastructure being

consolidated into a few locations for non CDN services, our exposure

to service downtime has dramatically increased. Its clear that we need

to expand the scope of where we backup to, how we backup, how we

anticipate failure and our ability to restore services in a timely

manner should there be facilities outages. In the coming weeks, we are

going to undertake a deep dive into our Infrastructure design and

delivery and try to first come up with a consolidated set of risks we

need to manage against, and then work towards reducing the risk,

spreading the availability as needed.

Our backend storage platform for the DevCloud and persistent

storage for other nodes in the facility is run from a distributed,

replicated Gluster setup. Inspite of the sudden loss of power, in a

production environment with hundreds of running VMs and dozens of

running data jobs, we were able to trivially recover our entire data

set with minimum data loss. Some of the running VMs inside the

DevCloud did see local filesystem issues, but we dont think that was a

backing storage issue. This event has dramatically increased out

confidence in the gluster technology stack and we will certainly be

looking at extending deployments for it internally.

== Comments about hosting facility ==

Their Status post about this

http://status.uk2.net/2016/02/24/london-power-outage/

We have multiple racks at this facility, and have a long standing

relationship with them going back to late Summer 2012. Over this

period we have had a near perfect uptime record for our equipment

there. And above all we have been consistently impressed with the

speed of and the knowledgeable support we've recieved at the DC. In

many cases, how the facility reacts to outage defines the real service

value - and in this case, we can only commend the fantastic support we

had through the outage hours. We do however feel there could be better

monitoring and reporting of some of the facilities information and

will be working with them to improve in those regards.

Fabian Arrotin and Karanbir Singh

The CentOS Project

-----BEGIN PGP SIGNATURE-----

Version: GnuPG v2.0.22 (GNU/Linux)

iQEcBAEBAgAGBQJW+6mPAAoJEI3Oi2Mx7xbtHo8IAI+RVIDjGwJOzgJ5Ry7mHwLe

Zc+aBUQklDk5oRaDk7QZHsaGp1lclNsutBk3YujNlXwMC4hUKdPwkTVuX50usQ7s

kd7qF1BlElNyfMPfFJGwchIQBDOZqZxkZP4uOrvQUnIZUYfyx6NnPnGS0uatBdnw

hBJ6TbgP6i50h7U0fNWjHU2I8xe0zsx1jVrvNngDMlQcIHC0d1KMtpOgSMR5f9Bn

bLwghfD4/yPyqJP1sc+021ANk1+a7uXs4KKG3MXpMlFyvYmv2ict0Q/sDtz0jzCx

kbRgDGm/GF1TUUENciESkHPKy3kLWA1oCicOkiEhzNz2YwFQNdNpi9PqWEK/F5Q=

=bDIN

-----END PGP SIGNATURE-----

_______________________________________________

Sign In

[CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

Recommended Posts

news 28

Share this post

Link to post

Please sign in to comment

Browse

Activity