[TSTIL] Mendix Cloud v4 - Part 2 - How to launch a complicated product

[ This is a part of "The Software That I Love", a series of posts about Software that I created or had a small part in ]

2018 - Mendix Cloud v4 - Part 2 - How to launch a complicated product

Note: This was a multi-year mega project (at least for my standards) and ran from about 2015 till 2018 when I left Mendix. It's still one of my proudest professional achievements. If you haven't yet, read part 1 about Cloud v4.

During 2017/2018 our team was doing a lot of things at the same time:
  • Handle the operational fires in Mendix Cloud v1/v2/v3.

  • Add new hardware with 50%-100% YoY growth, as well as handle all kinds of customer requests, like resizing VMs or adding more storage. These were all manual operations by our engineers.

  • Migrate v1/v2 apps to v3 to ensure standardization.

  • Build v4 which would make everything standard, scalable and self-service. It would be the solution to solve the chaos from v1/v2/v3.
Fires? What fires?

Running a 24/7 hosting operation for customers that are not so technical is tricky. Our brand promise was to make hosting Mendix apps simple for customers, so that meant that WE had to do a lot of hard things. We had lots of crisis situations:
  • Unexpected security vulnerabilities like HeartBleed that had to be patched. For HeartBleed we patched everything within about 4 hours after the announcement, thanks in a large part to Hans.

  • Physical servers breaking down, leading to about a hundred apps going offline at a time.

  • Operating system migrations (e.g. from Debian Squeeze to Wheezy), that had to be manually upgraded across thousands of VMs.

  • Programming errors leading to data corruption, on hypervisor level and also in the Mendix app that managed the entire cloud.
This may sound strange, but I loved the adrenaline rush when a crisis started. I now have a sort of spider sense for it. As Product Manager I often took the lead, jumped into organizing mode and got customer service, the 2nd & 3rd line engineers and management lined up. I tried to hold regular status overviews via organized excel sheets and write customer communication updates. After a couple of crisis situations we set up https://status.mendix.com to streamline our communication, that helped a lot. I built pretty good relationships with our people in security, customer support and CloudOps. At easee, my later company, the server setup was super boring and I secretly missed the adrenaline rushes quite a bit.

Needed features and the people that built them

But back to v4. We had expanded our team with Emir, Hans T., Cenk, Raphael, and Diogo. Unfortunately Riccardo left for consultancy HFT pastures at some point, and Emir left too a bit later IIRC. The team started building v4 services. We could spin up empty Cloud Foundry clusters quite easily, but to get feature parity with v3 we needed:
  • Backup services (nightly + PITR as a high-end option)
  • Service brokers for S3 and RDS
  • Reaper / resumer for Free Apps
  • A Cloud Portal that could support both v3 and v4
  • A lot of improvements to the buildpack (client certificate support etc)
  • A logging service (+ live logging would be nice)
  • A monitoring service
  • A load balancer / SSL/TLS terminator
  • Dashboards for monitoring
That's a lot of things we had to build! We also needed to up our documentation game because we were really lagging behind there. I managed to poach Kasja as a technical writer and she started creating a lot of articles, as well as bringing a lot of fun to the team. For some reason the atmosphere in our office was better on the days that she was working. It's nice to have diverse teams and not only smart introverted nerds (she is also a smart introverted nerd, but that aside).

The plan to avoid the Second-system effect

By this time v3 was pretty mature, and I knew enough about the software industry that I was dead-set on avoiding the Second-system effect. So: no new features in v4 except for scaling horizontally, which we would get out of the box with Cloud Foundry. In fact, we would start with less features than v3, and start migrating the customers for whom the limited feature set in v4 edition was enough.

I figured we would launch v4 in four phases:
  1. Launch a v4 beta for Sandboxes / Free Apps without any guarantees. This way we could remove the hacky Sandboxes implementation in v3 and get some real world experience with running v4.

  2. Launch some small non-critical apps on v4.

  3. Make v4 the default for new apps (v4 unless you need VPNs, 99,99% uptime, etc.)

  4. Start phasing out v3 and move everything to v4.
I think I had the plan outlined in my head, but I didn't communicate this at all. People outside our team were really confused about what was happening and when, especially as all the other migrations from v2 were also still going on. It turns out Project Management is a skill, and I didn't have it yet. My approach was "It's done when it's done and today we are just trying to survive". I didn't put any concrete dates on launches, because in that chaotic environment it was hard to be predictable.

Struggles & Highlights

Nevertheless, the plan continued but some stressful situations followed. Here are some highlights:

1 ) Making Free Apps work with the Web Modeler. The Free Apps cluster had to be made faster and compatible with the Web Modeler. Read about that in 2015 - InstaDeploy. Getting this more or less stable was a lot of work and had a lot of management eyes on it.

2 ) Sales team briefing. At some point I sat down with the Head of Marketing (Hans de V.) and we decided I should brief the sales teams on the new v4 system that was coming up. I expected some kind of dialogue, but instead sat next to Hans (who is really great at giving presentations), and was given the microphone in front of a Zoom call with 40 sales people. It was one of the absolute worst performances I've ever done. I had given plenty of mediocre presentations where I could look the audience in the face and get some kind of feedback. That interaction compensated my chaotic presentation style a bit, but here, talking into the webcam and just seeing my own slides was throwing me off. These sort of things require adaptability, preparation and experience and I had none.

3 ) Latency issues. Our great new "High Availability" feature where apps were hosted Multi-AZ turned out to be a performance disaster for some apps. We had customer apps where 1 API call would result in 1000s of sequential database requests . The latency to the database could jump between 0.1ms and 5 ms depending on the AZ, so after re-deploys the performance of apps would be dramatically different. Oops. This did not make our Customer Service and Professional Services teams happy.

4) Migrating 6000 apps single-handedly in one night. After a while I wanted to migrate all Free Apps from the first and outdated Cloud Foundry cluster to a new and more professional cluster. These were about 6000 apps that needed their database, application code, dns entries and files moved to another Cloud Foundry cluster. Doing this properly with the team, checklists and reviewed migration scripts would take weeks. I decided to do it by myself in one night with a python script from my home after a long working day. Keeping track of this huge migration hit the limits of my brain and the stresslevel during the migration was insane. This turned out well but was very stupid, something could easily have gone wrong.

5 ) AWS Costs. The costs at AWS were going through the roof. After a couple of weeks I discussed going for up-front investments with the Head of Finance and we had to transfer 300k EUR just for the first batch. Nowadays whenever I need anything from AWS NL I mention that I was (for a large part) responsible for getting Mendix on AWS and they are always happy to help me for landing a huge account.

6) Road Trip. Mendix R&D employees had a 1500 euro yearly training budget. Most people would not use it, or go to a nearby boring conference, spend 500 on the conference ticket, 500 on travel, and 500 on food & hotel. I wanted something more epic. With about 8 colleagues we pooled all our resources, obtained free conference tickets (because the company sponsored Cloud Foundry) and went on an epic 10 day trip. First by plane from Amsterdam to Las Vegas, then by fancy convertible to Death Valley, Yosemite, San Francisco, the conference in San Jose, on to Los Angeles and back to Vegas. I'll write another post about this in the future, but it was one of the greatest trips of my life. We also learned a lot about Cloud Foundry.

Wrapping up and moving on

In early 2018 I was getting ready to leave Mendix to be CTO at a tiny startup called easee. But by this time we had finished two important milestones, so I could leave without a sense of abandoning the project.

Some unsuspecting new customer was the first to go live on v4, the CSM team figured that this customer wouldn't need any of the v3 features and would not do mission critical stuff any time soon. Then more new, non-critical customers joined. Around this time we also publicly announced Cloud v4 with a blog post that is now lost.

About half a year after that, we decided (together with CSM, Support and Professional Services) that all new apps would go to v4 by default!

This completed step 3 of my 4-step plan. It had taken years of discipline of not giving out guarantees of VPNs, of not promising NL-based hosting, of promoting 12-factor architecture best practices, but now it was ready. Most apps could "just" be deployed to v4 and scaled horizontally, and we could scale the underlying infrastructure with just our credit card!

The verdict

Mendix Cloud v4 brought a lot of benefits:
  • Global deployment: we could offer locations in all AWS Regions within days, while before it would have been a huge pain to find a co-location provider, get approvals, buy hardware, fly there and set up the hardware. Let alone maintaining and growing it.

  • Standardization: we offered a standard set of functionality and therefore could operate the apps very well.

  • High availability: apps were running on multi-AZ 12-factor apps combined with highly available AWS RDS and AWS S3. Compare this with single-VM apps that could easily be down for an hour in case of hardware failure.

  • App scalability: apps could scale horizontally and could be paired with gigantic RDS instances, something that we would have struggled with in v3 because of how tightly we packed apps onto physical machines.

  • Organizational scalability: we could upgrade clusters without application disruptions (so no need to align with all customers) and could shrink and grow our cloud without doubling our engineering team. App resizes were also largely self-service, reducing the load on Customer Service.
There were also some downsides:
  • Ultimately Cloud Foundry lost the OSS cloud war. In 2015/2016 we decided for Cloud Foundry. Kubernetes was in its infancy and seemed overly complex and not enterprisey enough. Despite that it turned out to be the big winner and we were stuck with Cloud Foundry. For the end user it wouldn't matter, as Cloud Foundry is just the engine underneath Cloud v4 and it can be swapped out for Kubernetes without customers noticing the difference. It was a sane choice at the time and we couldn't have predicted the future.

  • It was expensive. I estimate that our infra costs were about 5 times higher on AWS than they were on own hardware because of how optimized we were running things on our own before.
I think the pros far outweighed the cons.

To seasoned engineers this makes total sense, but just to re-iterate what worked well in the process:
  1. We kept the feature set of v4 as limited as possible to avoid the Second-system effect.

  2. We prepared for the fundamental architecture changes for years. E.g. adding horizontal scalability support to the runtime, and offloading storage to S3 instead of the local file system. These had to be planned long in advance, without any direct benefit.

  3. We didn't "shout" about v4 or commit publicly to a launch date until we were very close to ready.

  4. We kept the communication lines open between all departments: R&D, sales, CSM, Customer Service, CloudOps. All were actively involved in getting the first customers on v4.

  5. We always launched with non-critical applications first to gain experience.

What happened afterwards

After I left, the Mendix Cloud team (now called the Cloud Unit) grew to about 100 people. They continued improving the software, expanding the global footprint and offering dedicated Mendix Cloud regions for big customers. After the last apps on v3 turned off in early 2023, I believe the v4 name was officially retired, but it's still there in the documentation URL: https://docs.mendix.com/developerportal/deploy/mxcloudv4/ . Note that there are 16 public regions now! In v3 we just had four regions, two of which were in The Netherlands. Also note that even 6 years after launching v4, there is still no v5, as the "chaos" of all the previous non-standard versions has been solved, there is no need to have multiple different versions.

My personal take-away

The crowning achievement for me personally were three applications that really needed the scale or fault-tolerance that Cloud v4 offered. The first was the PostNL Order Management System which had a huge landscape that needed to be deployed with horizontally scaled applications. The other was an app for the Huishoudbeurs in 2018, this was a companion app for the fair which needed to scale rapidly with thousands of concurrent users for a couple of days. The final one was a Japanese customer which needed a solution hosted very near Japan for latency reasons. Luckily AWS was there and we could spin up a new region within a couple of days. None of these could have worked on v3.

What makes me especially proud of the product is A) that it took years of disciplined, coordinated effort to make it possible B) that we managed to successfully host v3 and scale with the business needs year over year, while C) also spending just enough engineering capacity to build the much better solution with v4. The team was also really fantastic, I was having the time of my life with these guys.

We did all that with just 12 people.

I still remember this logo fondly

P.S. if you want your own "Mendix Cloud"-like hosting on Kubernetes ask my friend Xiwen who's building low-ops.com .


Popular posts from this blog

AI programming tools should be added to the Joel Test

The unreasonable effectiveness of i3, or: ten years of a boring desktop environment

The long long tail of AI applications