How Rackspace really went down

A tipster sent us the most comprehensive incident report to date on the downtime at Rackspace's Grapevine, Texas datacenter. Things happened pretty much as we reported. Though this was the third outage in two days, from what we can tell, it was unrelated to the others. The first two outages were caused by failing bits of Rackspace's internal power distribution network. The third, much larger outage was caused by a traffic accident. A summary of the report's findings, after the jump.

A pickup driven by a man experiencing a drop in his blood sugar crashed into a transformer that provides power to the datacenter. This disabled one source of power to the building. During a power failure, Rackspace's uninterruptible power supplies normally keep the servers running until the generators can be started. The chiller units that keep the server rooms cool automatically shut down and are restarted on generator power — a process that takes approximately 30 minutes. Once everything was up and running normally on generators, Rackspace decided to switch internal power to its secondary utility power source — going through the 30-minute chiller reset period one more time. During this period, the utility company shut off power completely to the building once again. Generators kicked in and servers were kept up and running.

As a result of the second power failure, the chillers once again had to be reset. Because of this repeated reset period with no cooling at all, temperatures inside the datacenter skyrocketed and servers had to be shut down to prevent heat damage. Once the chillers were brought back online and ambient temperature was reduced to acceptable levels, technicians began bringing affected machines back online.

Want more? Here's the full report:

Incident Report — DFW1 Power Review
November 11th / November 12th 2007

Introduction
This Incident Report has been prepared by Rackspace to provide customers with the information relevant to the service disruptions in our DFW Data Center on November 11th and 12th 2007. This report includes a detailed description of the incident and a timeline of our actions. Rackspace Engineering teams are working on a corrective action plan in response to these events. We will prepare a follow-up report on the action plan.

This report documents a series of sequential events in the DFW Data Center. Please note that customers affected by one event may not have been affected by all of the incidents. Your Account Manager will have specific information regarding your configuration.

In line with our promise of Fanatical Support™ and the messages from Lanham Napier, our CEO, regarding this event, we are focused on supporting our customers and providing the information you need for your business and customers.

Incident Overview & Cause
*All listed times are in US central time
Incident 1 — At approximately 4:30 A.M.* Sunday November 11th, Rackspace experienced a power outage in one section of our DFW data center. In the design of the DFW power infrastructure, power flows from the uninterruptible power supplies (UPSs) through a switch board to multiple power distribution units (PDUs). The power outage was caused by a breaker failure in the switch board and affected all downstream PDUs. The reason that this failure did not cause the entire data center to switch to generator power is because its impact was limited to infrastructure components downstream of the UPS.

Rackspace engineers immediately began troubleshooting. Our first objective was to restore service. After our initial efforts to restore power were unsuccessful, we began to investigate alternate solutions. One option was to switch to generator power to bypass the failed breaker but, with the root cause still undetermined, Rackspace needed to be certain that making this transition would not worsen the situation. After confirming that the proposed solution would be an improvement, Rackspace moved the affected portion of the data center to generator power. According to our data, all services were restored by 6:50 A.M. on November 11th.

Incident 2 — At approximately 6:30 P.M. on the evening of the 11th, a breaker in the generator power grid tripped. The impact of this event was similar to the initial outage but on a much smaller scale. When this breaker failed, one PDU lost power. All customer devices with dual power supplies in this section of the data center remained online and were not affected. While the PDU in question was without power, devices with a single power supply were affected. Data Center technicians immediately acted to minimize the impact on these customers by moving these devices manually to alternate power supplies. Rackspace technicians worked to restore service as quickly as possible and by 7:40 P.M. the affected PDU was back online.

Our data center engineers worked through the night to attempt to identify both the root cause of the problem and a safe return path to utility power. At approximately 5:00 A.M. on November 12th Rackspace began the process to return to utility power. By 5:10 A. M. the transfer of power back to utility power was completed.

Incident 3 — At 5:25 A.M. there was a problem with the utility distribution network. This caused another outage that affected the same section of the data center as the initial outage on the 11th. DC Engineers immediately initiated a switch back to generator power. By 5:40 A.M. the generators were once again providing power to the affected portion of the data center. There are several aspects of the distribution network that are currently being investigated and tested to determine the root cause.

Incident 4 — In an unrelated incident, at approximately 6:30 P.M. Monday, November 12th, a vehicle struck and brought down the transformer feeding power to the DFW data center. During a utility power failure like this, the chillers, a component of the HVAC system responsible for maintaining consistent temperature in the data center, automatically lose power as they are not directly powered from the UPS. Upon power failure, our emergency generators kicked in and provided power to the chillers as intended. At this point the DFW data center was fully operational.
The DFW facility has two separate utility feeds and the engineers decided to start moving from generators to the secondary utility feed. Each time Rackspace alternates between utility and generator power, the chillers require us to follow a shut down and restart procedure. This procedure normally takes approximately 30 minutes to complete. We started the transition to the secondary utility feed and initiated the restart process on the chillers.

Unfortunately, at this point, our utility provider shut down the secondary feed that was powering the data center, without notifying Rackspace. This was an emergency action taken by the utility in order to allow safe removal of the accident vehicle and protect the emergency responders. This unexpected power cut required DFW to switch back to generator power and reinitiate the chiller start up procedure. The repeated cycling of the chillers resulted in increasing temperatures within the data center.

At 07:35 P.M. the ambient temperature reached a critical level within the facility so Rackspace began to manually shut down some servers in the affected areas. This action was taken to protect them from overheating, prevent data loss and hardware failures, and reduce temperatures. By powering down these devices and bringing the chillers back online on generator power, we were able to reduce the ambient temperature to an acceptable level. Lower temperatures allowed us to power up the devices that had been shut down earlier. At 11:37 P.M. we switched from generator power back to utility power for the areas of the data center that lost power due to the incident on Monday night.

The devices in the section of the data center affected by the 1st outage on November 11th remain on generator power. Our engineering team is investigating solution options to resolve the original issue. We will not move to transition off of generator power until we can determine the root cause and test the proposed solution. We will notify affected customers of the maintenance window in advanced before we begin this switchover.

Incident Timeline
4:19 A.M. November 11th — A problem in the internal utility power distribution grid caused a breaker failure resulting in loss of power to a portion of the DFW data center.

6:49 A.M. — The transition to generator power is completed and power is restored to the affected portion of DFW.

6:32 P.M. — A breaker in the generator power grid tripped causing a loss of power to a single PDU, including a smaller subset of the original group of customers.

7:40 P.M. — The situation was resolved and the PDU returned to service. Between 6:32 P.M. and 7:40 P.M. all devices with dual power supplies maintained power and were not affected. Customer devices with single power supplies in this area were affected. Data Center technicians immediately acted to minimize the impact on these customers by moving these devices manually to alternate power supplies.

4:00 A.M. November 12th — The utility distribution grid completed realignment and re-synchronization; all systems reported as ready for operation.

4:30 A.M. — Transfer of power was initiated and affected devices were slowly moved off of generator power and back to internal utility distribution power.

5:10 A.M. — Transfer was complete and all devices in DFW were running on utility power.

5:25 A.M. — The internal distribution grid failed again. Data Center engineering acted immediately to transfer all affected devices back to generator power.

5:40 A.M. — The transfer to generator power was completed in under 15 minutes and all affected devices had access to power.

6:30 P.M. — In a completely unrelated incident, an automobile accident caused damage to the primary utility feed to the DFW Data Center. The Data Center automatically failed over to generator power without any service interruption to customer devices.

6:45 P.M. — The Data Center Engineering team decided to transition from generators to the secondary utility power feed. The chillers were moved to the secondary utility power feed and started powering up.

7:12 P.M. — The local utility company shut the second feed down, without notifying Rackspace. This was an emergency action to protect emergency responders and ensure safe clearing of the accident scene. At this time the DFW facility returned to generator power.
Each time the data center alternated power sources the chiller systems reinitiated their start up sequence preventing the data center from receiving proper cooling.

7:35 P.M. — Due to the rising temperature inside the data center, Rackspace started shutting down infrastructure and servers in order to prevent them from overheating, prevent data loss and hardware failures, and to keep the entire facility from overheating.

7:55 P.M. — The chillers were brought back online and the facility was stabilized on generator power.

8:20 P.M. — The temperature and power systems had stabilized to a point where we were able to begin powering on servers that had been shut down earlier.

11:37 P.M. — The area of the facility affected by the Monday evening traffic accident-induced incident was transferred back to utility power. The transfer was accomplished without any service interruption. The chillers continued operating normally.

Investigation
Throughout this situation our data center management team has been meeting with internal and external experts in an attempt to isolate the root cause of the original outage. Rackspace engineers and contractors are onsite evaluating all the systems and the events that triggered the failure. The
original breaker is being thoroughly tested to rule it out as a possible cause. We also brought a load bank onsite to test the UPS cluster. These tests will allow us to determine when it is safe to return to utility power.

Next Steps
We are waiting on the results of the additional tests before planning to return to utility power. Rackspace will provide additional information as it becomes available and will provide advance notice before returning to utility power.