[TSTIL] Instadeploy

[ This is a part of "The Software That I Love", a series of posts about Software that I created or had a small part in ]

2015 - InstaDeploy

When we launched "Free Apps" at Mendix I was pretty proud. It was basically "Sandboxes V2 + WebModeler". Under the hood we had this whole new architecture with all kinds of clever "cloud-scale" solutions. It turned out that users were not so happy. Every time you changed something in your app you had to deploy it. We wanted people to use Free Apps so we hid the old "local deployment" option. The Free Apps deployment service was buggy and slow. Local deployment used to take 10 seconds. The old sandboxes took 1 minute, the Free Apps took 4. On bad days it took 8.

The cloud team had launched the deployment service without any SLAs or basic explanation of the limitations to the other teams. A bit immature. The WebModeler and Modeler teams just had to use it and see how it worked. In testing it was already very flaky, but under production usage it was really bad. We thought we could get away with it because in the cloud, the rubber hits the road and we had failing stuff all the time. The other teams had stable unit tests etc. Not us. We thought that was cool. Now the ones who got most of the flak from management was the WebModeler team. They had all the visibility and the deploy button was in their product. They felt helpless.

Me and the cloud guys were not very open to helping them out. I was under a lot of stress to launch cloud v4. In the meantime we were scaling cloud v3 and needed new hardware every couple of months. This took most of my time and mental energy. I also could not justify the focus on our Free Tier while our paid production workloads were in danger. Sorry for having to deal with that, Erik vd P., Michel, Kishen, Daniel D. and Arjan B. . After it got really bad we took part of the blame and we started figuring out how to optimize the deployment time.

Then we had the annual Company Kick-Off in Noordwijk (the Mendix CKO). I ran into Derek, and said that I was confident we could get it to two minutes. His response was very Steve Jobs-like: "Two minutes? Why should it take any time at all? It should feel instant! If I make change I want to see it right away!". F*ck. He was right. We needed to rethink our entire approach.

I've never had a clearer call to action. I don't know how, but I managed to convince Johan that we had to put the top hackers from every team in one room for two days. This was Benny for the Modeler/MxBuild, Michel for the WebModeler, Xiwen and Daniel for Cloud, and maybe some more people. Let me know if you were there. I was there too, "managing". Because of Derek's speech I was fired up. Everyone else was happy we could finally fix the deployment problems. We all felt like we were on a mission from God.

We added some timing logs, broke down the deployment process and started optimizing our own separate parts of it. We started at 4 minutes and had ideas on how to get to 2. I realized we had to take ugly shortcuts to get beyond that. Instead of sending the deployment package to the deployment service, it had to go straight to the running container. This was very unholy and the 12-factor-app gods did not approve. I thought of Bill O'Reilly. "Fuck it, we'll do it live". We modified the buildpack to include a compiler on hot standby. When it got a deployment package on a separate secret HTTP path, it would trigger the compiler and reload the app. Benny built the mxbuild server. Xiwen got it to run in the container. Michel optimized the webmodeler so it could create a deployment package in 10 seconds instead of 20.

The time kept dropping. We were at 40 seconds. Benny optimized some more with pre-warming the build cache. Michel said "a-ha" and got package creation down to 0.1 seconds. At night Xiwen or me finally got mxbuild to run. On the second day the end-to-end unholy deployment cycle was working. Not in 2 minutes, not in 1 minute, but in 2 seconds. Holy shit.

It made for a great demo, a great story and when it worked, a great product. It never got super reliable, but boy was it fast.






Comments

Popular posts from this blog

"Security Is Our Top Priority" is BS

AI programming tools should be added to the Joel Test

The long long tail of AI applications