Monday, 11 February 2019

UUIDs in MySQL follow up

Last week my blog post on UUIDs in MySQL stirred a discussion on Hacker News.

I wrote the post in about half an hour and stupidly enough, I did not think about UUID v1. I knew about v1 (from the Melissa virus case), v4 and v3 & v5 (which I once used to generate deterministic UUIDs). I just didn't link my case to UUID v1 because I've hard-wired UUID with UUIDv4 in my head.

Here are two observations from the discussion on HN.

First: over time, the first characters of UUIDv1 are still uniformly distributed.



The only reason I had the same prefix was because I filled in all values with one UPDATE query when I initialized the column. If on the other hand you use UUID() whenever you create a new row, the first characters will be distributed evenly.

Second: the different UUID versions should be more clearly distinguished in general. The Zen of Python has a line:
"Explicit is better than implicit."
In following this, I think it's better not to expose any UUID() function, but only explicitly offer the typed versions UUIDv1(), UUIDv4() etc. This is already what most libraries do as far as I've seen.

Tuesday, 5 February 2019

UUIDs in MySQL are really not random

Universally Unique Identifiers (UUIDs) are great. I love how you can tell the progress of a batch job just by looking at the current UUID. If it starts with 0..., the task is less than 1/16th done. If it starts with 7d.., we're almost halfway there. At ff... we are nearing the end. The fact that you can tell this rests on two principles: 1) you sort your jobs by their uuid and 2) UUIDs are random, as in, distributed uniformly.

However, last week, I noticed a strange thing: a clearly visible pattern in the uuid column of a database table. It should be impossible, but there it was. It looked like this:

> SELECT uuid FROM example ORDER BY id;
4f95de28-0fd1-48db-ad2e-34ecd169c483
4331cb9e-1d91-11e9-be2c-45923c63e8a2
4331cc4c-1d91-11e9-be2c-45923c63e8a2
4331ccec-1d91-11e9-be2c-45923c63e8a2
4331cd7e-1d91-11e9-be2c-45923c63e8a2
c7e2f124-f6ba-4434-843f-89958a7436ec
4331ce10-1d91-11e9-be2c-45923c63e8a2
4331ce9e-1d91-11e9-be2c-45923c63e8a2
4331cf28-1d91-11e9-be2c-45923c63e8a2
4331cfaf-1d91-11e9-be2c-45923c63e8a2
4331d017-1d91-11e9-be2c-45923c63e8a2
4331d0c6-1d91-11e9-be2c-45923c63e8a2
4331d139-1d91-11e9-be2c-45923c63e8a2
4331d1a7-1d91-11e9-be2c-45923c63e8a2
4331d20e-1d91-11e9-be2c-45923c63e8a2
4331d271-1d91-11e9-be2c-45923c63e8a2
4331d2d7-1d91-11e9-be2c-45923c63e8a2
3e18f8dd-b1d3-4e16-8a81-4bdceac91772
4331d33a-1d91-11e9-be2c-45923c63e8a2
f6b8658d-846b-4418-a79c-713db516203e
4331d3a8-1d91-11e9-be2c-45923c63e8a2
4331d40d-1d91-11e9-be2c-45923c63e8a2
4331d48e-1d91-11e9-be2c-45923c63e8a2
4331d4fe-1d91-11e9-be2c-45923c63e8a2
4331d567-1d91-11e9-be2c-45923c63e8a2
4331d5cb-1d91-11e9-be2c-45923c63e8a2
4331d630-1d91-11e9-be2c-45923c63e8a2
4331d691-1d91-11e9-be2c-45923c63e8a2
14a5cb4c-f336-4240-aff3-4ffcfa8d135f
4331d6f9-1d91-11e9-be2c-45923c63e8a2
4331d762-1d91-11e9-be2c-45923c63e8a2
4331d7c8-1d91-11e9-be2c-45923c63e8a2
4331d83d-1d91-11e9-be2c-45923c63e8a2



Hm, that is very weird. Did we maybe convert an old auto-incrementing integer column into UUIDs in a very stupid way? Did we maybe use UUID version 3 or 5? Did our library corrupt the first bits of the binary representation of the UUID? After a while, I remembered that we initialized this column like so:

UPDATE example SET uuid = UUID() WHERE uuid IS NULL;

I also remembered reading this in the MySQL documentation:
"Warning: Although UUID() values are intended to be unique, they are not necessarily unguessable or unpredictable. If unpredictability is required, UUID values should be generated some other way." 
If you are like me, you won't use UUID() for secrets after reading this (and I didn't!). If you are even more like me, you will think that this is like the difference between /dev/urandom and /dev/random and that the uniform distribution law still applies here. However, to my great surprise, the UUIDs generated by UUID() are not uniform at all! The result is that a significant portion of UUIDs in our database are not uniform:

> SELECT COUNT(*), LEFT(uuid, 1) FROM example GROUP BY LEFT(uuid, 1);
+----------+---------------+
| count(*) | LEFT(guid, 1) |
+----------+---------------+
| 1943     | 0 
            |
| 1871     | 1             |
| 1913     | 2             |
| 1843     | 3             |
| 3050     | 4             |
| 1943     | 5             |
| 1889     | 6             |
| 1866     | 7             |
| 1865     | 8             |
| 1903     | 9             |
| 1868     | a             |
| 1898     | b             |
| 1854     | c             |
| 1897     | d             |
| 1941     | e             |
| 1836     | f             |
+----------+---------------+


So, the lesson for today is: take warnings in documentation seriously. If you used UUID() for data that is supposed to be secret (like passwords), you have a serious problem, as these secrets can now be easily guessed.

Edit: read this follow-up post

Saturday, 26 January 2019

Highly Productive Teams and their Speed Bumps

In Civilisation, the fantastic BBC series from the 60's, Kenneth Clark is puzzled by the short burst of time in which cathedrals, spiritual movements and art styles are created. For a couple of decades (which sounds long by our standards), there seem to be surges in productivity, creativity and energy in almost the entire population. More was accomplished in these bursts than in the centuries of relative inactivity before or after.

In my experience, product development is very comparable. When looking back at some of the most important software projects in which I've played a part, there were bursts of just a couple of days where we were incredibly productive. We were energized, worked as one and were in a state of flow most of the time. The best part is this: although the experiences are very intense, they don't leave you very tired. Instead, afterwards you feel energized and you realize that you just did your best work! As a CTO, I would love my team to work like this!

Is it possible to achieve this high productivity reliably? I often fear that our competitors are consistently operating at this breakneck speed, but actually I suspect that there are very, very few teams in the world that get close. So, perhaps it is simply not possible. Maybe my memory is playing tricks on me, maybe I'm just nostalgic, or maybe it's another case of the 80% / 20% law, where 80% of the work gets done in 20% of the time and the finishing touches are actually the hardest and most boring to get right. Nevertheless, I want to learn something from these high productivity moments and create the right environment for my team to experience them.

These are the common denominators of the high-productivity bursts I've experienced, either when working solo or with a team:
  • There is a clear goal: build X or solve problem Y.
  • Everyone should be well-rested (as in, not tired or hungover).
  • The team consists just of people with a can-do attitude, no negativity. One sour face severely impacts the rest of the team.
  • The members don't have to be close personal friends, but they absolutely need to respect each other's skills.
  • Autonomy: everything that is needed to accomplish the project should be in one room and focused on the project. If more is needed, make sure that it's just one phone call away.
  • Everyone in the room can add something, if not, pair up with someone or get the team lunch or coffee, don't go browsing reddit like a zombie.
  • There is an organizer / tie-breaker that people look to for when they're stuck or have questions.
  • Leadership that acknowledges the importance of the project. Together with respect for each other's skills, this provides the necessary Psychological Safety to the team
  • A small team of max 5 people. Coordination and keeping everyone focused becomes too hard when the team is bigger.
There are also "speed bumps" that make a high productivity environment impossible.
  • No clear mission.
  • Negative team members. Although there is lots of value in criticism and being cautious, especially when dealing with high performance or security requirements, voice these concerns in a positive way, or, when the project is just a Proof of Concept, say that we'll tackle it when going to production.
  • People that like talking about the project more than actually getting to work and building it.
  • External dependencies outside of the room that are not immediately available, such as designers, the legal department or a product owner. Make sure you make everyone is available to the team. If you still have external dependencies, it can mean two things: 1) the goal or scope is actually not clear to the team or 2) the project doesn't have the backing of leadership.
  • No clear organizer, team members staring at each other waiting for others to make a move.
The most critical speed bump is "external dependencies". As soon as the creative process requires something that is not instantly available, productivity takes a huge hit. This is why productivity is so hard in larger organizations where the amount of stakeholders is very large.

Sunday, 7 October 2018

Code Should be Readable

On the first day of my first job, at age 23, I learned the most important lesson of my life about programming: code should be readable. Four years of CS at university did not prepare me for this, but one day of work did (thanks Achiel!).

I had created something "clever" and was barked at by the Senior Engineer of the team that my solution was too complicated. For a second I thought he was suffering from low IQ, but he explained that:

  1. He was not able to instantly understand my code. Lesson: It was apparently vital that other people in the team understood my code. In fact, it was not even my code, the team owned it.
  2. The problem I was solving was simple and not worth spending much mental capacity on. Lesson: writing code is just a small part of what happens with it, code is much more frequently read, tested, debugged and refactored. All these activities take time and mental capacity.
  3. In 6 months, I would have forgotten the code and would look at it the same way he did now (that is: with a frown and raised eyebrows). Lesson: You are temporarily suffering from understanding the code too well, make sure you compensate for that.
  4. We were paid (quite well) to make product for the company, not to be smart-asses. Lesson: time is money and there is something called Opportunity Cost that makes your time even more valuable. Boring code is good code.
These lessons have stuck with me forever and made me allergic to complicated code. Of course sometimes problems really are complex, and there is big difference between "essential complexity" and "merely complicated". Senior Software Engineers should have a well-developed gut feeling to distinguish between the two.

I don't care much for fanatic discussions about Test Driven Development or Micro Services vs. Monoliths because reality is much less clear-cut. I think it's MUCH more important that whenever you create software, or basically anything*, that you keep this in mind:
  • the complexity of the solution should match the complexity of the problem
  • what you create should be easy to understand for those who work with it

As long as you build your software by these two rules, you should be alright.




* read The Design of Everyday things by Don Norman and Don't Make me Think by Steve Krug

Necessary Evil

In one of the codebases I've worked with we had a module called "NecessaryEvil". All of the dirty hacks were put in there. This had three good outcomes:


  1. We could track our technical debt by looking at the size of the module (and the references to it)
  2. Every time you saw a reference to a method in NecessaryEvil, your brain made a shortcut and instantly reminded you of how the thing you were looking at "worked" (that is: via a dirty hack, or in a non-intuitive way). This alone saved a lot of time in debugging.
  3. Over time the framework we used got more features. Every now and then we looked at stuff in "NecessaryEvil" and found things that could now be solved properly. This was highly motivational to the developers, because (a) it gave them a sense of ownership and (b) they saw the progress of the framework we used.


I highly recommend to make your "dirty hacks" explicit in your code base. Here is how you should do it:

  1. Put all the dirty hacks in a directory / module / package where possible.
  2. Where #1 is not possible: add `dirty_hack_` or `DirtyHack` in the function or variable names. Don't put this in the comments, because comments don't show up in references, call stacks or stacktraces.
  3. Add a measurement to your CI infrasture that counts the amount of DirtyHacks and references to DirtyHacks in your codebase.
Happy hacking!

Every Day Shampoo

Naming things is hard. Especially names for things that you want to market. From personal experience I can say that it's a lot easier to name a kid than to name a company or a product, but maybe that's just me.

The BEST product name I've ever come across is "Every Day Shampoo". I've seen it only in the Netherlands, where it's called "Iedere dag" but it probably exists in other countries too. Bottles of "Every Day Shampoo" are surrounded by exotic shampoos called "Ocean Breeze", "Rainforest Fresh" or ones with feature focused names such as "Anti-frizz" and "Colour Lock".

Instead, "Every Day Shampoo" simply tells you that you can use it every - single - day! Before this product came to market, people used to wonder if it's healthy to shampoo every day. Now they no longer need to think about that! Or at least they don't if only they buy this shampoo right now! Because maybe the others are actually bad for your hair!?  I like to think about how much more shampoo this company has sold because of this brilliant name.

Sunday, 14 January 2018

S3 boto3 'StreamingBody' object has no attribute 'tell'

I was recently trying to work with the python package warcio and feeding an s3 object from the common crawl bucket directly into it.

r = s3.get_object(Key='crawl-data/file....', Bucket='commoncrawl')
for record in ArchiveIterator(r['Body']):
    pass

However, this fails with the error:
self.offset = self.fh.tell()
AttributeError: 'StreamingBody' object has no attribute 'tell'

The reason is that boto3 s3 objects don't support tell. It's easily fixable by creating a tiny class:

class S3ObjectWithTell:
    def __init__(self, s3object):
        self.s3object = s3object
        self.offset = 0

    def read(self, amount=None):
        result = self.s3object.read(amount)
        self.offset += len(result)
        return result

    def close(self):
        self.s3object.close()

    def tell(self):
        return self.offset

You can now use this class and change
for record in ArchiveIterator(r['Body']):
into

for record in ArchiveIterator(S3ObjectWithTell(r['Body'])):