This image was lost some time after publication, but you can still view it here.

I'm continuing to investigate the story of the outage Tuesday at 365 Main's San Francisco datacenter that brought down some of the most well-known sites on the Internet. Right now, a 365 Main executive is blaming failures at 5 out of its 10 generators. That's right: Fully half of 365 Main's generators failed right as San Francisco experienced a power outage. More to come on this soon, but for now, here's the memo from Marcy Maxwell, 365 Main's head of security.

From: "Marcy Maxwell"
To: "Engineering" ; "Security"
Sent: 7/25/07 5:08 PM
Subject: UPDATE: POWER EVENT - Fourth Notice

UPDATE: 5:00 P.M., Wednesday, July 25, 2007

Advertisement

A complete investigation of the power incident continues with several specialists and 365 Main employees working around the clock to address the incident.

Generator/Electrical Design Overview

Sponsored

The San Francisco facility has ten 2.1 MW back-up generators to be used in the event of a loss of utility. The electrical design is N+2, meaning 8 primary generators can successfully power the building (labeled 1-8), with 2 generators available on stand-by (labeled Back-up 1 and Back-up 2) in case there are any failures with the primary 8.

Each primary generator backs-up a corresponding colocation room, with generator 1 backing up colocation room 1, generator 2 backing up colocation room 2, and so on.

Series of Electrical Events

* The following is a description of the electrical events that took place in the San Francisco facility following the power surge on July 24, 2007:

* When the initial surge was detected at 1:47 p.m., the building's electrical system attempted to roll all colocation rooms to diesel generator power.

* Generator 1 detected a problem in its start sequence and shut itself down within 8-10 seconds. The cause of the start-up failure is still under investigation though engineers have narrowed the list of suspected components to 2-3 items. We are testing each of these suspected components to determine if service or replacement is the best option. Generator 1 was started manually by on-site engineers and reestablished stable diesel power by 2:24 p.m.

* After initial failure, Generator 1 attempted to pass its 732 kW load to Back-up 1, which also detected a problem in its start sequence. The exact cause of the Back-up 1 start sequence failure is also under investigation.

* After Generator 1 and Back-up 1 failed to carry the 732 kW, the load was transferred to Back-up 2 which correctly accepted the load as designed.

* Generator 3 started up and ran for 30 seconds before it too detected a problem in the start sequence and passed an additional 780 kW to Back-up 2 as designed.

* Generator 4 started up and ran for 2 seconds before detecting a problem in the start sequence, passing its 900 kW load on to Back-up 2. This 900kW brought the total load on Back-up 2 to over 2.4 MW, ultimately overloading the 2.1 MW Back-up 2 unit, causing it to fail. Generator 4 was manually started and brought back into operations at 2:22 p.m. Generator 4 was switched to utility operations at 7:05 a.m. on 7/25 to address an exhaust leak but is operational and available in the event of another outage.

* Generators 2, 5, 6, 7 and 8 all operated as designed and carried their respective loads appropriately.

* By 1:30 p.m. on Wednesday, July 25, after assurance from PG&E officials that utility power had been stable for at least 18+ continuous hours, 365 Main placed diesel engines back in standby and switched generators 2,5,6,7, 8 to utility power.

* Customers in colocation rooms 2, 4, 5, 6, 7 & 8 are once again powered by utility, and are backed up in an N+1 configuration with Back-up 2 generator available.

* Generators that had failed during the start-up sequence but were performing normally after manual start (1 & 3) continue to operate on diesel and will not be switched back to utility until the root causes of their respective failures are corrected.

Other Discoveries

* In addition to previously known affected colocation rooms 1, 3 and 4, we have discovered that several customers in colo room 7 were affected by a 490 millisecond outage caused when the dual power input PDUs in colo 7 experienced open circuits on both sources. A dedicated team of engineers is currently investigating the PDU issue.

Next Steps

* Determine exact cause of generator start-up failure and PDU issues through comprehensive testing methodology.

* Replacements for all suspected components have been ordered and are en route.

* Continue to run generators 1 & 3 on diesel power until automatic start-up failure root cause is corrected.

* Continue to update customers with details of the ongoing investigation.

Regards,

Marcy

Marcy Maxwell Vice President, Security 365 Main Inc. "The World's Finest Data Centers"