Sunday, 7 October 2018

Code Should be Readable

On the first day of my first job, at age 23, I learned the most important lesson of my life about programming: code should be readable. Four years of CS at university did not prepare me for this, but one day of work did (thanks Achiel!).

I had created something "clever" and was barked at by the Senior Engineer of the team that my solution was too complicated. For a second I thought he was suffering from low IQ, but he explained that:

  1. He was not able to instantly understand my code. Lesson: It was apparently vital that other people in the team understood my code. In fact, it was not even my code, the team owned it.
  2. The problem I was solving was simple and not worth spending much mental capacity on. Lesson: writing code is just a small part of what happens with it, code is much more frequently read, tested, debugged and refactored. All these activities take time and mental capacity.
  3. In 6 months, I would have forgotten the code and would look at it the same way he did now (that is: with a frown and raised eyebrows). Lesson: You are temporarily suffering from understanding the code too well, make sure you compensate for that.
  4. We were paid (quite well) to make product for the company, not to be smart-asses. Lesson: time is money and there is something called Opportunity Cost that makes your time even more valuable. Boring code is good code.
These lessons have stuck with me forever and made me allergic to complicated code. Of course sometimes problems really are complex, and there is big difference between "essential complexity" and "merely complicated". Senior Software Engineers should have a well-developed gut feeling to distinguish between the two.

I don't care much for fanatic discussions about Test Driven Development or Micro Services vs. Monoliths because reality is much less clear-cut. I think it's MUCH more important that whenever you create software, or basically anything*, that you keep this in mind:
  • the complexity of the solution should match the complexity of the problem
  • what you create should be easy to understand for those who work with it

As long as you build your software by these two rules, you should be alright.




* read The Design of Everyday things by Don Norman and Don't Make me Think by Steve Krug

Necessary Evil

In one of the codebases I've worked with we had a module called "NecessaryEvil". All of the dirty hacks were put in there. This had three good outcomes:


  1. We could track our technical debt by looking at the size of the module (and the references to it)
  2. Every time you saw a reference to a method in NecessaryEvil, your brain made a shortcut and instantly reminded you of how the thing you were looking at "worked" (that is: via a dirty hack, or in a non-intuitive way). This alone saved a lot of time in debugging.
  3. Over time the framework we used got more features. Every now and then we looked at stuff in "NecessaryEvil" and found things that could now be solved properly. This was highly motivational to the developers, because (a) it gave them a sense of ownership and (b) they saw the progress of the framework we used.


I highly recommend to make your "dirty hacks" explicit in your code base. Here is how you should do it:

  1. Put all the dirty hacks in a directory / module / package where possible.
  2. Where #1 is not possible: add `dirty_hack_` or `DirtyHack` in the function or variable names. Don't put this in the comments, because comments don't show up in references, call stacks or stacktraces.
  3. Add a measurement to your CI infrasture that counts the amount of DirtyHacks and references to DirtyHacks in your codebase.
Happy hacking!

Every Day Shampoo

Naming things is hard. Especially names for things that you want to market. From personal experience I can say that it's a lot easier to name a kid than to name a company or a product, but maybe that's just me.

The BEST product name I've ever come across is "Every Day Shampoo". I've seen it only in the Netherlands, where it's called "Iedere dag" but it probably exists in other countries too. Bottles of "Every Day Shampoo" are surrounded by exotic shampoos called "Ocean Breeze", "Rainforest Fresh" or ones with feature focused names such as "Anti-frizz" and "Colour Lock".

Instead, "Every Day Shampoo" simply tells you that you can use it every - single - day! Before this product came to market, people used to wonder if it's healthy to shampoo every day. Now they no longer need to think about that! Or at least they don't if only they buy this shampoo right now! Because maybe the others are actually bad for your hair!?  I like to think about how much more shampoo this company has sold because of this brilliant name.

Sunday, 14 January 2018

S3 boto3 'StreamingBody' object has no attribute 'tell'

I was recently trying to work with the python package warcio and feeding an s3 object from the common crawl bucket directly into it.

r = s3.get_object(Key='crawl-data/file....', Bucket='commoncrawl')
for record in ArchiveIterator(r['Body']):
    pass

However, this fails with the error:
self.offset = self.fh.tell()
AttributeError: 'StreamingBody' object has no attribute 'tell'

The reason is that boto3 s3 objects don't support tell. It's easily fixable by creating a tiny class:

class S3ObjectWithTell:
    def __init__(self, s3object):
        self.s3object = s3object
        self.offset = 0

    def read(self, amount=None):
        result = self.s3object.read(amount)
        self.offset += len(result)
        return result

    def close(self):
        self.s3object.close()

    def tell(self):
        return self.offset

You can now use this class and change
for record in ArchiveIterator(r['Body']):
into

for record in ArchiveIterator(S3ObjectWithTell(r['Body'])):

Sunday, 17 January 2016

Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AWS

Last May I was working on hobby project similar to this: https://github.com/zakjan/cert-chain-resolver/ . As I found the cert-chain-resolver project a couple of days later I did nothing with the results, but I got some nice comments on how I used 1 VM to download & process 10TB in a couple of hours on this HN thread recently so I decided to do a write up on the process and publish the data.

See the parts below:



My approach was somewhat different from the github project above, instead of using the AIA extension I wanted to brute-force the solution by finding all known intermediate and root certificates in advance. Based on the checksum of the issuer/subject fields I could look up which certificates "claimed" to be the signer of the certificate and then using the signature I could filter out which ones actually were. You can use my online service like this:

cat CERT.crt | curl -F myfile=@- -X POST https://certinator.cfapps.io/chain/

You then get the entire chain back on your prompt.

Part 1: downloading 10TB of metadata in 4 hours

Brute-forcing always leads to interesting optimization challenges. In this case I needed to scrape as many certificates from the SNI-enabled web as I could, and for that I needed to find the longest list of domain names on the Internet available. There is the Alexa top 1 million, but I thought something better should be available. I could not find anything as a direct download, but I did stumble upon the Common Crawl project. They scan a huge amount of urls and publish the results on S3. Luckily they split the metadata (headers etc) from the actual data, and the url is listed in the metadata. This could be filtered down to a list of all urls, which could be filtered down to a list of all domain names. Certainly an approach that is promising as I don't need the entire data set, just the metadata will suffice.

Still, the amount of metadata of one scan is about 10TB. Downloading that from my home with a 60Mb connection would take about 15 days to download and would probably violate the Fair Use Policy of my ISP.

However, the data is on S3 and AWS offers 32 core, 10Gb instances with free data transfer to the c3.8xlarge machine. This leads to a theoretical processing time of 2 hours if the network is the bottleneck, or about a 100x speedup from my home connection! The machine costs about $1.70 per hour. I would not need any hadoop/map-reduce or any type of clustering to get results.

Now I'm a big fan of using simple unix commands to get complex tasks done. Using the "split" command I split out the list of metadata files into separate chunks of about 100 files. Each line of these chunks were processed with a pipeline like this

curl -s http://aws-publicdatasets.s3.amazonaws.com/$i | zgrep '^WARC-Target-URI:' | cut -d'/' -f 3 | cut -d'?' -f 1 | sort | uniq

This filters a 150MB gzipped file down to about 100KB of raw domain names. I used a bit of python with gevent for job scheduling, but looking back "xargs -n 1 -P 50" would have done just as well. Even though the chunk results were only 100KB, after processing thousands of them the 8GB root partition was filling up, so I started "rsync --remove-source-files" in a loop to transfer the result files to my laptop. I could have stored the results on the EBS volume, but then after all hard work was done I would need to wait until the results (10GB) were downloaded to my laptop. Better download them incrementally and throw away the $1.70/hour machine the minute I was done with it. Not that an extra $1.70 would break the bank, but I challenged myself to keeping the efficiency at a maximum.

I used "nmon" to determine the optimal amount of parallellization, which turned out to be about 30 simultaneous downloads. The bottleneck turned out the be CPU rather than network, by a small margin. Unpacking the .gz files is expensive! The 30 gzip commands were at the top in "top". At 30 simultaneous downloads I was doing 95% CPU on 32 cores and about 5Gb of network traffic. Initially I launched the in the wrong region so latency to S3 was pretty high, that had a pretty big influence of performance but I didn't write down the exact numbers. For the record, the data set is hosted in us-east-1.

At my day job I regularly work with about 30 physical machines with 288GB RAM and 16 cores each, but I think for those 4 hours in May this "little" VM did more than all that infrastructure combined.

Part 2: fetching a ****load of certificates


So now I had a whole lot of files with a whole lot of domain names each with many duplicates. I needed to get that down to one big list of unique domain names. I'm usually on the McIlroy side in Knuth vs McIlroy, in that I prefer simple commands that get the job done. In this case, doing

cat segments/* | sort | uniq

would be preferable. The problem was that the original data set containing all duplicates did not fit in my laptop's memory and "sort" needs to keep the whole thing in memory, so I solved it with a bit of python and a marisa_trie where duplicates are removed. Looking back, maybe "sort -u" has the same optimization? Anyway, I also learned that setting "LC_ALL=C" significantly speeds up "sort" (about 4x in my case) and can be used if you're not using special characters.

So now I had 26 million domain names. How do we get the X509 certificates from those servers? I created a small program in go that does just that:

I executed it with

cat domainnames | go run fetch.go

This ran for a couple of hours on my laptop with quite low CPU and network usage. It does nothing but a SNI enabled SSL/TSL handshake and stores all the certificates under a filename that's the sha1 sum.

Now I had about 1.4 million X509 certificates.

Part 3: playing with the data


Here are the links for:
- archive of certificates (tar.gz torrent)
- archive of domain names (.gz torrent)

The following program prints hashes of the public key field, the entire certificate, the subject and the issuer.



Using the output of

tar -zxOf certificates.tar.gz | go run print-certs.go > hashes

I was able to find some nice things:

  • People reusing the same public/private key across multiple certificates. Some guy in Russia did this 174 times.  sort hashes | uniq -w 30 -c -d | sort -n | tail
  • Top 3 CAs are 3: RapidSSL CA, 2: COMODO RSA Domain Validation Secure Server CA, 1: Go Daddy Secure Certificate Authority - G2. cat hashes | cut -f 4 -d ' ' | sort | uniq -c | sort -n | tail
  • 59693 self-signed certificates with subject C=US, ST=Virginia, L=Herndon, O=Parallels, OU=Parallels Panel, CN=Parallels Panel/emailAddress=info@parallels.com
You could use the data set to scan for insecure public keys (< 1024 bit length), outdated certificates, which dates are most popular to create certificates or how long before revocation people tend to renew their certificate, to name a few ideas.
I managed to design, run and process the results of this experiment in one weekend. The total AWS costs were just under $10 for running the $1.70/h machines for 5 hours. Getting data out of AWS is pretty expensive normally, but as I was only getting data into my instance the network costs were a couple of cents at most. The rest of the experiment could easily run on my laptop and home Internet connection. All in all I think this is a great example of using AWS for CPU/network intensive experiments and I certainly got my money's worth of computing power!

If you have a any questions or ideas for improvement, I'd like to hear it in the comments on Hacker News!

Tuesday, 23 June 2015

Why you and I will never have a 15 hour work week

tldr: time to market.

Recently Keynes' prediction of a 15 hour work week has resurfaced in discussions ranging from automation to basic income. Why exactly do we still work 40+ hours? I've seen many reasons*, but I'd like to focus on something I missed in all the articles and comments: time to market. Companies need their creative employees to deliver at peak performance to achieve the fastest time to market. How can they invent & ship their products as soon as possible? Sticking to a small, stable team that works about 40 hours per week per person.

There are two reasons for this:
  1. In creative professions the individuals are not easily interchangeable. You can have a different bricklayer continue your part of a wall the next day, getting another programmer to continue your piece of code is much more difficult. The same people have to work on a project for sustained periods of time.
  2. In the last century it has been demonstrated that you get most out of employees when they work close to 40 hours per week.

So in conclusion: I don't think that innovating companies that want to stay competitive will ever let their creative people have a 15 hour work week. Nothing stops the employees from retiring at 40 though ;)

discuss on HN


* These are the reasons I've seen:

  • Wealth inequality: the top X % are reaping the benefits, the rest don't have the wealth to work less
  • Work as leisure: people enjoy work more than their free time
  • Bullsh*t jobs: we invent unimportant work to keep busy
  • Consumerism: we want more and higher quality things

Wednesday, 21 May 2014

Browsers Should Shame ISPs for not Providing IPv6

tldr;
I propose something like in this image, so browsers put some pressure on ISP's and webhosts to adopt ipv6




For years we have heard that the Internet is about to break because the world is running out of ipv4 addresses. Sad, but not true. In my career as a programmer I've seen dirty hacks grow into extremely elaborate systems, despite the fact that they were originally set up to solve only moderately difficult problems. NATting is certainly a dirty hack, and now that ipv4 addresses are running out we're going to see lots of multi-level NAT gateways, which are certainly elaborate systems if they need to provide high availability.

The problem is that these elaborate systems are difficult to maintain, and with each additional change you sink deeper into the quicksand. So even though I believe engineers will be able to work around the problems in ever dirtier ways it would be really good for the Internet if we could throw away the old system and switch to ipv6 en masse.

Who can help us escape from the quicksand that is ipv4? The Government? A plane? Superman? No. Browsers.

The Internet is in a catch-22 situation concerning the ipv6 switch. Consumers don't feel the need to switch as long as they can still use ipv4, and producers don't want to invest in ipv6 while all consumers still support ipv4. We should keep in mind that the general public is completely unaware of this issue. When I tell my fellow programmers that I've set up an ipv6 tunnel at home, they don't say "wow that's amazing, can you help me set one up too?". No, they say "cool story bro", shrug and walk away. In a way that's logical as the end-user gains nothing by the switch, everything worked just fine in the dirty-hack system. The only winners are the Internet engineers. But if this is how programmers who know about the issue respond, then what does the average consumer know about ipv6? Nothing.

Now what does everyone, including the average consumer, use to connect to the Internet? That's right: browsers.

What if your browser displayed a green endorsement like this (I know, I'm not a designer).



And if your ISP (or the server) is stupid:

The information block should also contain a link to a site displaying why ipv6 is good for the Internet. I know that if I were a CEO, I would not like customers to see this on the website of my company.

Good idea? Bad idea? Discuss on HN.