The time I literally dropped a server

This is a short story about one of the stupidest things I ever did in my professional life, namely, dropping a physical server from the top of the rack in a data center.

Spring feels like my favorite season, not because it beats summer, but because it carries the promise of summer. In spring I'm so happy that the depressing gray days of winter are coming to an end.

Early 2014 I was driving from the Mendix office in Rotterdam to our co-location data center (DC for short) in Amsterdam. I felt fantastic, for a couple of reasons: I was driving my new car, I was paid to do an important job with a $10k 1U server in the boot, and as the cherry on top, it was finally spring and the weather was great! There have not many times in my life I felt this wonderful. Little did I know that I was now at the top of a rollercoaster, and on the way back I would feel very different.

All of the physical servers at Mendix had Pokemon names. The one in my car was called Flareon. It had broken down a couple of weeks before and had gotten fixed by Hans, our lead systems engineer. We now had to put it back in the rack with its fellow Pokemon. Even though I was the team's product manager and on paper, driving to the data center and replacing a server was not really my job, I had gotten my DC access badge a couple of months before and I was happy to do it. This was my first solo mission.

Flareon is the second server from the top. Note that left metal extrusion with the Intel logo is misaligned with the rest of the servers. After reading the rest you'll know why.


When I got through the security gate, the floor looked deserted. I drove the cart to Flareon's aisle and stopped at the rack. As soon as I opened it, I noticed something. The empty slot where the server was supposed to be, was placed quite high up. I could only just reach it if I stood on my toes. 

The rack looked roughly like this (sorry for the rough sketch, it's from memory).

| Switch 1   |
|  Server 1  |
|  Server 2  |
|  Server 3  |
|  Server 4  |
|  Server 5  |
|  Server 6  |
| Switch 2   |
|  Server 7  |
|  Server 8  |
|  Server 9  |
|  Server 10  |
|  Server 11 |
|  Server 12 |
| Netapp SAN (4U?) |

Flareon was server 2 in this rack, and was about 2m / 7ft up. No matter, I grabbed a little step ladder and slid out the rack rails. I went down to get the server and noticed it was a bit difficult to climb the ladder with an unwieldy and heavy server that weighs about 15kg / 33lbs. 

When the server and I reached to the top, I also noticed that the rail mount had a system with four latches that had to click all together. I knew this already, but somehow I had never been as aware as now. A further complication was that when it latches, you need to click the latch on the left or right side to release it. But you're probably holding the server with one hand in the front, and one hand in the back (so yes, not left or right). For the lower placed servers you could hold the server with one hand and a knee, and use your spare hand to release the latch but at 2m in the air, balancing on a step ladder, I had no knees or spare hands available. I suddenly realized I had come to a two-person job alone and felt very lonely.

I now had a choice. I could go down and give up, or see if there was someone available to help. Alternatively I could just go for it and try to do it by myself. What's the worst that could happen? My instincts have been shaped by software development, and there the worst that can happen is not very bad. You can always do git reset.

So after a second or two of deliberation while balancing on the step ladder with a server that was not getting any lighter, I decided to go for it. Standing on the left side, I would lift the server over the rails, and I'd try to drop the two latches on one side into place. Then I would move my hands slowly to the other side and push the other two carefully in. I never got that far. One latch clicked, and the other one missed its opening. To fix it, I’d need one hand to hold the server and another to release the latch. Impossible. Within half a second the 15kg server started falling 2 meters towards to the ground. The rails twisted and released the latch. I tried to grab the server on its way, but lost my balance and fell after it like Gandalf fell after the Balrog.

I don't think the data center had ever been exposed to this amount of sound during normal operations. It was deafening. Within 10 seconds I realized the floor was in fact not deserted, as two guys who had been working three aisles down came running towards me. "What the f*ck happened?!" they shouted. Actually the scene was quite straightforward and after a moment they had figured it out. I was so shocked I didn't even answer. They asked if I was alright and then offered help. I noticed I was not hurt myself, and the server had a dented front but otherwise looked alright.

With the help of the other guys we put Flareon in its rack. I wasn't sure if it was broken or not, but it couldn't stay on the ground so why not put it up there now that I had two volunteers?

When the server was in place, my new friends went back to their own servers and I tried to calm my nerves. I looked at my phone and saw a lot of Nagios alerts coming in. What the hell?

After deciphering some of them and realizing another server had suddenly gone offline, I looked up and sure enough, the ON led of the server underneath Flareon looked like it was very much OFF. In my fall I must have hit the power button... I called Hans for help, explained what had gone wrong and he calmly reassured me that we would fix this first, and figure out the rest later. What a great team mate!

Hans set out to restart all the applications on this server. About 50 Mendix apps were down for half an hour or so (I tried to find this incident on the status page but we only started using it a couple weeks after this incident. Coincidence?). In the meantime I hooked up Flareon's power and network and started it up. Hans did a system test and everything looked alright.

I locked up the rack, packed the cart in my car and started driving back. On the way home I felt incredibly stupid. I had dropped a $10k machine, caused a serious production outage and suddenly realized this could have ended much, much worse.

See, the server fell down and could even easily hit the Netapp with 40+ disks spinning at 10k RPM. Those things don't like vibrations. There are stories of disk arrays being destroyed by way less than what happened today. Rebooting a server is annoying, but destroying ALL primary data for hundreds of customers would have been a disaster of unimaginable consequences for the company. I got away easy, but the realization made me nauseous.

That spring day in 2014 is burned into my memory - a real rollercoaster. Here are three things that I've learned:

- Hardware is not from software, there is no git reset

- Never do a two-person job solo

- Disable the power button on servers

 

The story became infamous within the company. A couple months later we went back and did a re-enactment.

Yours truly. Horse mask and server opened up for dramatic effect.



P.S. I looked up some instruction videos for the latching mechanism for the DL380p. I was probably doing it wrong as you're supposed to take the latches apart, attach half to the server, and just slide it in! So lesson number four:


- Read the instructions.




Discuss on HN.

Comments

Popular posts from this blog

The long long tail of AI applications

"Security Is Our Top Priority" is BS

The Future of Programming Systems - four thoughts