[TSTIL] The too clever scheduling service

[ This is a part of "The Software That I Love", a series of posts about Software that I created or had a small part in ]

2016 - The too clever scheduling service

In Mendix Cloud v4 we only wanted to use fault-tolerant services. So no single VMs, but "services" that were "web-scale", "clustered" and "horizontally scalable". Most services we created were request based. So a request would come in, and the service would respond. Simple. We built everything according to the 12-factor app architecture and ran apps on Cloud Foundry. On the AWS side we used lambda and SQS. This was our toolbox for the new architecture.

One service that was not request-based was the backup service. Every night we had to create backups. Users didn't trigger this backup creation, we had to trigger it ourselves. If you use a traditional VM, you'd set up a cron job and you're done. In our new world this was somehow considered bad, but we had no tools in our toolbox to schedule events. AWS Lambda has it now, but back then it did not.

For the backup-service we set up an app in Cloud Foundry that would listen to an SQS queue. If we put a "create backup for app X" job on the queue, the backup service would pick it up and create a backup. This makes sense and is a pretty good architecture. However, we needed something to put these jobs on the queue every night for all apps. Xiwen and I came up with something that was supremely clever. We would put another kind of job on the queue. That job was "run all the nightly backups for <DATE>, and do this at <TIME>."

If the app would see this job, and it was currently past <TIME>, it would trigger the nightly backups AND submit a job for the next day. If it was not yet <TIME>, it would hold the job for about a minute, and then put it back on the queue. The app kept kicking the "can" down the road. Ad infinitum. 

We thought this was clever. If anything ever happened to "the can", there would be no more backups, not this day, nor the next, never. Full of hubris we thought this would never happen. Our team started hiring new people during this time, and Xiwen and I proudly told them about the design. The responses were kind of mixed. When we told Emir he was very impressed by our engineering prowess. When Hans T. came in, he was like "what on earth were you guys smoking".

We went ahead anyway and launched. I remember we had to "insert the magic packet" into the queue by hand to kick-start the process. What happened next was kind of funny. On some nights we had no backups. On some nights we had two, sometimes three. Sometimes half the apps had a backup. We had no idea what was happening.  Hans T had been right, the system was complicated, hard to debug and super fragile.

Because we launched on our free tier we had some time to "screw around", and we got it under control eventually. I don't remember how, but I remember we put duct-tape on duct-tape to make it reasonably reliable. In hindsight we should have been less principled. We had all these super reliable VMs from Cloud v3, and adding a simple cron job to start the backups in Cloud v4 would have been trivial. We didn't do it because we shouldn't "cross the streams". The KISS principle is really important when engineering production systems. "Keep it simple stupid." Better have a small ugly wart than a big one with a lot of make up plastered over it. It was a good lesson for all of us.


Popular posts from this blog

AI programming tools should be added to the Joel Test

The unreasonable effectiveness of i3, or: ten years of a boring desktop environment

The long long tail of AI applications