đź’˝ This is the 800 Hz Vibration That Broke a Data Center

Why your data slows down—and how physics knew it before big data did.

We often hear about the high-performance chips and networking that make up the cloud information technology utility. However, hard disk drives escape our attention. Even though our life’s data is stored on these elegant sub-systems, they remain unsung and unheard of heroes in Internet data centers.

HDDs are considered inexpensive, so they are easily replaceable if they fail. The failure can be hard (completely broken) or soft (performance degradation). And, having redundant drives (local and geographically redundant in the cloud) makes up for any concerns about loss of data.  

 Internet data centers, that make up the cloud, are filled with thousands of hard disk drives (HDD) with the total storage capacity measured in exabytes (billions of gigabytes). The cost-effectiveness of HDD, over solid state drives (SSD), makes them ideal for bulk storage. Consequently, the majority of data storage in Internet data centers is handled by HDDs. 

While the HDDs may be cheap, an immense amount of energy is required to make, ship, operate these HDDs. Minimizing faults for sustainability must be the mantra. 

Predicting Resonance and HDD Failures

Source: Chandrakant Patel

I was invited to speak at the USENIX FAST Conference in 2008. In my keynote, I projected a hypothetical failure scenario to expect in HDDs. I suggested that as new high heat dissipating servers are introduced, the HDDs will be subject to faults or hard failure. Pulsating vibration stimulus from high speed fans needed for cooling would cause the HDD faults.

To back up my hypothesis, I sketched out a future 1U (44 mm tall rack mounted) server, and of a hard drive internal head arm assembly, and used fundamentals to calculate (see figure):

  1. The natural frequency of a head arm in a disk drive in a server. I estimated the natural frequency of a head arm in a hard disk driven given the geometry and boundary conditions to be 819 Hz. The figure shows the problem setup and the answer page, but not all the calculations. Shoot me an email (see bio) if you’d like to receive the full thing.

  2. I showed that future servers that will house these hard disk drives will have fans rotating at 12,000 revolutions per minute, or 200 revolutions per sec, or simply 200 Hz. The fans will be four bladed, and therefore, the blade passing frequency will generate a vibration stimulus of 800 Hz.

I showed that these future fans in the server will produce a pulsating stimulus (800 Hz) that matches the head arm natural frequency (~819 Hz). The HDDs will fail.

Likelihood of HDD Errors

It is also noteworthy that the pressure drop across these servers will be high (approximated 250 Pascals or 1 in of water) due to dense packaging chips and support devices. The fans have to be designed to overcome this resistance to flow, and thus be very powerful.

Based on my simple “back of the envelope” calculations, I predicted that we will see HDD errors—either degradation in performance or crash—as a result of excessive vibration of the head arm. 

Source: Chandrakant Patel

Indeed, the head arm is akin to the wine glass in the 1970s Memorex shattering wine glass commercial in our newsletter To AI or Not to AI: Before You Prompt, Try Physics.

Furthermore, if the errors are soft (degradation in performance), they will not be repeatable when the hard drives are tested on the bench. The stimulus present in the server would not exist.

The Ground Truth

A few months after my talk, I received emails from data center operators about such failures. The HDDs had slowed down (drop in throughput of data due to oscillating heads that kept trying to read or write). 

Some said that they took the HDD out, sent it back to the manufacturer, and no errors were found by the manufacturer in their benchtop testing.

👉 Question: So, why no error on benchtop standalone testing?

👉 Answer: On the benchtop, the stimulus that caused resonance induced failure did not exist. 

As is often the case after large scale failures, I learned that data was collected from the hard drives and other sensors to figure out the root cause of failures without much success. Data scientists could find clusters of failures by data mining of historical data.

They tried to correlate the failure to temperature, etc, without reaching a specific conclusion.

Clearly data, and data mining, alone without domain knowledge does not work in complex physical systems. Of course, sometimes one can guess and find a correlation, but those iterations are more energy intensive than using paper and pencil.

Parting Thoughts

This article is a case study of vibration induced hard disk drive failures (stop working or degradation in performance) in data centers. It underscores the importance of domain knowledge. 

When hard drives physically failed (not a cyber attack) in a data center, data scientists used the large amount of data collected from the drives to determine the root cause. They used statistical AI techniques to find the failure correlations to operating temperature, etc. However, the root cause could not be determined. 

A simple calculation was all that was needed to get to the root cause. A calculation that showed the alignment of the server cooling fan rotational speed to the natural frequency of the arm in a hard disk drive. The rotational speed (800 Hz) coincided with the head arm frequency of (813 Hz) resulting in excessive resonance induced vibration.

Our natural reaction today, when faced with these types of problems is:

  1. Let's ask a generative AI chatbot for the answer. A long list of possibilities comes up but one is often unable to pinpoint the root cause .

  2. The next step is to collect data and perform statistical AI to get to the root cause. However, data collection and analysis costs time, money, and energy.

In this case, both are good in showing the way, but cannot get to causation. In the world of Physical AI, it is best to start with domain knowledge. Even if one has to go to genAI, and eventually to Statistical AI, domain knowledge is still king.

Don't leave home without it.

👉 Tell us how we’re doing! CLICK HERE to complete our reader survey.

About the Author

Former SVP, Chief Engineer, and Senior Fellow at HP, Chandrakant is a leader in AI, energy-efficient computing, and sustainability. He is an IEEE Fellow, ASME Fellow, member of the NAE, and the Silicon Valley Engineering Hall of Fame. Follow me on LinkedIn or email me at [email protected].

Want to stay ahead of the curve with insights into the newest advancements in Physical AI? Subscribe to Chandrakant’s newsletter at GenAI Works.

🚀 Boost your business with us—advertise where 10M+ AI leaders engage

🌟 Sign up for the first AI Hub in the world.

📲 Our Socials

Reply

or to participate.