365 Main, the troubled datacenter operator, has finished its investigation into the failure at its San Francisco facility that knocked some of the Internet's most well-known websites, from Craigslist to LiveJournal to Technorati, offline back in July. Ridiculously, the company first tried to blame PG&E for the failure, knowing full well that its clients pay it for reliable power even in a blackout. (Equally ridiculously, I ran a suspect tip that a drunk employee had wreaked havoc in the datacenter.) Now, the company has completely exonerated itself, pinning the blame on a component in its generators. Here's why you still shouldn't believe a word the company says. My analysis, and the company's press release, after the jump.
Of course, 365 Main's generators failed. The company blames a memory chip in a piece of electronics used to start the generators automatically. But aren't these generators tested monthly? 365 Main notes that the component in question is only used in two of its datacenters. No word on whether the faulty testing procedures are also common to all of its facilities, or just present in San Francisco.
And the kicker? 365 Main brags about the fact that it has "delivered 99.9942 percent uptime to customers," which sounds impressive until you do the math and realize that means the 365/7/24 facility is actually out of service, routinely, for nearly half an hour every year. Last month's outage, in other words, was all in a day's work for 365 Main. On top of that, consider this: It's a failure rate six times as high as the "five nines" standard 365 Main promised when it launched. 365? More like 364.98.
Here's the press release. I recommend you trust it as much as you do the "365" in 365 Main's name.
365 MAIN REPORTS ON ROOT CAUSE OF GENERATOR FAILURE
Company Implements Fix for All Affected Generators and Makes Information
about the Fix Available to Data Center Industry
SAN FRANCISCO, Calif., Aug. 1, 2007 - Data center developer and operator 365
Main Inc. is issuing information today that details the root cause behind
why back-up power generators in the company's San Francisco facility failed
to start during a PG&E power outage last week, resulting in approximately 40
percent of customers in the facility losing power to their equipment for up
to 45 minutes.
At 1:47 p.m. on Tuesday, July 24, 365 Main's San Francisco data center was
impacted by a power surge caused when transformer breakers at a local PG&E
power station unexpectedly opened. PG&E has still not determined what caused
the breakers to open.
Typically when a power outage occurs, the outage triggers 365 Main's
rigorously maintained and tested back-up diesel generators to start-up and
take over providing power supply to customers. 365 Main's San Francisco
facility has ten 2.1 megawatt back-up generators to be used in the event of
a loss of utility power. Eight primary generators can successfully power the
building, with two generators available on stand-by in case there are any
failures with the primary eight.
However, following the power outage last week, three of 365 Main's 10
back-up power generators, manufactured by Hitec, failed to complete their
start sequence. A complete investigation of the incident began immediately.
Within hours of the incident, an international team of specialists was
deployed to 365 Main's San Francisco data center facility to join on-site
technicians and begin systematically testing the generators in search of a
root cause. After days of thorough testing around the clock, the team
discovered a weakness in an essential component of the back-up generator
system known as a DDEC (Detroit Diesel Electronic Controller).
The team discovered a setting in the DDEC that was not allowing the
component to correctly reset its memory. Erroneous data left in the DDEC's
memory subsequently caused misfiring or engine start failures when the
generators were called on to start during the power outage on July 24.
The investigation team discovered DDEC issues on each of the failed Hitec
units and were able to successfully simulate failure. A fix was introduced
by altering the timing of a command to the DDEC component, allowing more
time between the engine shut-down command and the DDEC reset command. Once
this fix was introduced, the Hitec generators successfully passed more than
50 consecutive start-up sequence tests without incident.
The testing methodology was performed by Hitec specialists along with 365
Main's chief technician and staff. Specialists from Cupertino Electric were
present during all testing, and EYP Mission Critical Facilities will provide
independent verification of the findings the week of 8/6/07.
365 Main has implemented the DDEC fix in its San Francisco and El Segundo
facilities. Of the five data centers in 365 Main's portfolio, the San
Francisco and El Segundo facilities are the only ones with Hitec generators
containing DDECs. All other facilities feature other brands of generators
or have different models of Hitecs.
365 Main is sharing the discoveries of its investigation with other Hitec
customers. In addition, Hitec has expanded its preventative maintenance
procedures as a direct result of discoveries made during the 365 Main
In the wake of the outage, 365 Main published an apology to customers and
daily updates directly from the investigation team meeting minutes, allowing
customers and the public at large to track progress. A complete archive of
these updates and more details about today's update are available at:
Chris Dolan, president and CEO of 365 Main, said, "365 Main has a track
record of providing customers with data centers that are considered to be
among the world's finest. We extend our sincere apologies to customers who
were impacted by this incident. Addressing customer concerns is our top
priority. In the days since the incident occurred, we have identified and
corrected the root source of the problem and are taking steps to prevent
this type of problem from happening again. We are also making our
comprehensive findings available to other data centers to try to prevent the
same problem from recurring elsewhere."
Glenn Ellis, president and CEO of Hitec USA, also commented: "Our top
priority is taking steps to prevent this type of unforeseen incident from
occurring again. We sincerely apologize to 365 Main and its customers that
our generators failed to deliver the continuous power as designed."
365 Main's Track Record
Since its inception over five years ago, 365 Main has delivered 99.9967
percent power uptime to customers across its five-data-center portfolio.
This includes the outage experienced in San Francisco last week. 365 Main's
San Francisco facility has delivered 99.9942 percent uptime to customers
during the last five years, inclusive of last week's outage.
As part of their service level agreements with 365 Main, 365 Main customers
receive rent abatements (refunds) in the event that electrical power is
dropped in the section(s) of the data center where their servers are
located. 365 Main is honoring all service level agreements with affected