[TSTIL] The Reaper
[ This is a part of "The Software That I Love", a series of posts about Software that I created or had a small part in ]
2017 - The Reaper
We launched Mendix Cloud v4 by starting the new Free Apps tier on it. These apps without SLA allowed us to take more risks and learn to operate Cloud Foundry. The Mendix Free Apps cluster had enough RAM for about 100 concurrent apps. It could scale up and down, but we wanted to keep the costs reasonable.
An app would run as long as there was HTTP traffic to it. If an app did not have any traffic for 30 minutes, we killed it. If an HTTP request would come in, we'd serve a "loading" page while the app was spinning back up.
We had a small app (The Resumer) that would catch all traffic for apps that were not running & serve the loading page. I believe Xiwen wrote most of it, but my memory is failing me. It would fire a request to the CF API to find and start the app. The Reaper was its counterpart. It would talk to the CF API, list all running apps. Then call the admin API of the app to see when its last request was handled, and kill it if > 1800 sec. For some reason, The Reaper stopped running once we had more than 15.000 (?) apps in the Free Tier. The cluster started misbehaving badly when out of capacity. We could not figure out why or how, but needed a fix.
When running the same code on my workstation everything worked fine and the cluster went back to normal. So, you can guess what I did. Against all proper engineering practices, I ran "The Reaper" from the workstation under my desk with a cron job for a couple of months. I was a Product Manager and should not be touching code or infrastructure at all, but I couldn't help myself. Later on, Daniel vD (now also a CTO), found that there was a bug in the CF API logic. Without admin credentials it would make a query that scaled O(n2) with the number of apps. Admin credentials didn't need the extra security protection for the query so that's why that was fast. As we were the only idiots in the world running Cloud Foundry with 15k stopped and 100 running apps, no one else had run into this.
Thanks to my friend Daniel you can read the issue here: https://github.com/cloudfoundry/cloud_controller_ng/issues/1272 . When he documented the issue we were apparently at 85k apps. We were growing fast! The Reaper was a pretty cool name for a pretty cool tool.
previous: 2016 - The too clever scheduling service
Comments
Post a Comment