Sunday, 14 January 2018

S3 boto3 'StreamingBody' object has no attribute 'tell'

I was recently trying to work with the python package warcio and feeding an s3 object from the common crawl bucket directly into it.

r = s3.get_object(Key='crawl-data/file....', Bucket='commoncrawl')
for record in ArchiveIterator(r['Body']):

However, this fails with the error:
self.offset = self.fh.tell()
AttributeError: 'StreamingBody' object has no attribute 'tell'

The reason is that boto3 s3 objects don't support tell. It's easily fixable by creating a tiny class:

class S3ObjectWithTell:
    def __init__(self, s3object):
        self.s3object = s3object
        self.offset = 0

    def read(self, amount=None):
        result =
        self.offset += len(result)
        return result

    def close(self):

    def tell(self):
        return self.offset

You can now use this class and change
for record in ArchiveIterator(r['Body']):

for record in ArchiveIterator(S3ObjectWithTell(r['Body'])):

Sunday, 17 January 2016

Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AWS

Last May I was working on hobby project similar to this: . As I found the cert-chain-resolver project a couple of days later I did nothing with the results, but I got some nice comments on how I used 1 VM to download & process 10TB in a couple of hours on this HN thread recently so I decided to do a write up on the process and publish the data.

See the parts below:

My approach was somewhat different from the github project above, instead of using the AIA extension I wanted to brute-force the solution by finding all known intermediate and root certificates in advance. Based on the checksum of the issuer/subject fields I could look up which certificates "claimed" to be the signer of the certificate and then using the signature I could filter out which ones actually were. You can use my online service like this:

cat CERT.crt | curl -F myfile=@- -X POST

You then get the entire chain back on your prompt.

Part 1: downloading 10TB of metadata in 4 hours

Brute-forcing always leads to interesting optimization challenges. In this case I needed to scrape as many certificates from the SNI-enabled web as I could, and for that I needed to find the longest list of domain names on the Internet available. There is the Alexa top 1 million, but I thought something better should be available. I could not find anything as a direct download, but I did stumble upon the Common Crawl project. They scan a huge amount of urls and publish the results on S3. Luckily they split the metadata (headers etc) from the actual data, and the url is listed in the metadata. This could be filtered down to a list of all urls, which could be filtered down to a list of all domain names. Certainly an approach that is promising as I don't need the entire data set, just the metadata will suffice.

Still, the amount of metadata of one scan is about 10TB. Downloading that from my home with a 60Mb connection would take about 15 days to download and would probably violate the Fair Use Policy of my ISP.

However, the data is on S3 and AWS offers 32 core, 10Gb instances with free data transfer to the c3.8xlarge machine. This leads to a theoretical processing time of 2 hours if the network is the bottleneck, or about a 100x speedup from my home connection! The machine costs about $1.70 per hour. I would not need any hadoop/map-reduce or any type of clustering to get results.

Now I'm a big fan of using simple unix commands to get complex tasks done. Using the "split" command I split out the list of metadata files into separate chunks of about 100 files. Each line of these chunks were processed with a pipeline like this

curl -s$i | zgrep '^WARC-Target-URI:' | cut -d'/' -f 3 | cut -d'?' -f 1 | sort | uniq

This filters a 150MB gzipped file down to about 100KB of raw domain names. I used a bit of python with gevent for job scheduling, but looking back "xargs -n 1 -P 50" would have done just as well. Even though the chunk results were only 100KB, after processing thousands of them the 8GB root partition was filling up, so I started "rsync --remove-source-files" in a loop to transfer the result files to my laptop. I could have stored the results on the EBS volume, but then after all hard work was done I would need to wait until the results (10GB) were downloaded to my laptop. Better download them incrementally and throw away the $1.70/hour machine the minute I was done with it. Not that an extra $1.70 would break the bank, but I challenged myself to keeping the efficiency at a maximum.

I used "nmon" to determine the optimal amount of parallellization, which turned out to be about 30 simultaneous downloads. The bottleneck turned out the be CPU rather than network, by a small margin. Unpacking the .gz files is expensive! The 30 gzip commands were at the top in "top". At 30 simultaneous downloads I was doing 95% CPU on 32 cores and about 5Gb of network traffic. Initially I launched the in the wrong region so latency to S3 was pretty high, that had a pretty big influence of performance but I didn't write down the exact numbers. For the record, the data set is hosted in us-east-1.

At my day job I regularly work with about 30 physical machines with 288GB RAM and 16 cores each, but I think for those 4 hours in May this "little" VM did more than all that infrastructure combined.

Part 2: fetching a ****load of certificates

So now I had a whole lot of files with a whole lot of domain names each with many duplicates. I needed to get that down to one big list of unique domain names. I'm usually on the McIlroy side in Knuth vs McIlroy, in that I prefer simple commands that get the job done. In this case, doing

cat segments/* | sort | uniq

would be preferable. The problem was that the original data set containing all duplicates did not fit in my laptop's memory and "sort" needs to keep the whole thing in memory, so I solved it with a bit of python and a marisa_trie where duplicates are removed. Looking back, maybe "sort -u" has the same optimization? Anyway, I also learned that setting "LC_ALL=C" significantly speeds up "sort" (about 4x in my case) and can be used if you're not using special characters.

So now I had 26 million domain names. How do we get the X509 certificates from those servers? I created a small program in go that does just that:

I executed it with

cat domainnames | go run fetch.go

This ran for a couple of hours on my laptop with quite low CPU and network usage. It does nothing but a SNI enabled SSL/TSL handshake and stores all the certificates under a filename that's the sha1 sum.

Now I had about 1.4 million X509 certificates.

Part 3: playing with the data

Here are the links for:
- archive of certificates (tar.gz torrent)
- archive of domain names (.gz torrent)

The following program prints hashes of the public key field, the entire certificate, the subject and the issuer.

Using the output of

tar -zxOf certificates.tar.gz | go run print-certs.go > hashes

I was able to find some nice things:

  • People reusing the same public/private key across multiple certificates. Some guy in Russia did this 174 times.  sort hashes | uniq -w 30 -c -d | sort -n | tail
  • Top 3 CAs are 3: RapidSSL CA, 2: COMODO RSA Domain Validation Secure Server CA, 1: Go Daddy Secure Certificate Authority - G2. cat hashes | cut -f 4 -d ' ' | sort | uniq -c | sort -n | tail
  • 59693 self-signed certificates with subject C=US, ST=Virginia, L=Herndon, O=Parallels, OU=Parallels Panel, CN=Parallels Panel/
You could use the data set to scan for insecure public keys (< 1024 bit length), outdated certificates, which dates are most popular to create certificates or how long before revocation people tend to renew their certificate, to name a few ideas.
I managed to design, run and process the results of this experiment in one weekend. The total AWS costs were just under $10 for running the $1.70/h machines for 5 hours. Getting data out of AWS is pretty expensive normally, but as I was only getting data into my instance the network costs were a couple of cents at most. The rest of the experiment could easily run on my laptop and home Internet connection. All in all I think this is a great example of using AWS for CPU/network intensive experiments and I certainly got my money's worth of computing power!

If you have a any questions or ideas for improvement, I'd like to hear it in the comments on Hacker News!

Tuesday, 23 June 2015

Why you and I will never have a 15 hour work week

tldr: time to market.

Recently Keynes' prediction of a 15 hour work week has resurfaced in discussions ranging from automation to basic income. Why exactly do we still work 40+ hours? I've seen many reasons*, but I'd like to focus on something I missed in all the articles and comments: time to market. Companies need their creative employees to deliver at peak performance to achieve the fastest time to market. How can they invent & ship their products as soon as possible? Sticking to a small, stable team that works about 40 hours per week per person.

There are two reasons for this:
  1. In creative professions the individuals are not easily interchangeable. You can have a different bricklayer continue your part of a wall the next day, getting another programmer to continue your piece of code is much more difficult. The same people have to work on a project for sustained periods of time.
  2. In the last century it has been demonstrated that you get most out of employees when they work close to 40 hours per week.

So in conclusion: I don't think that innovating companies that want to stay competitive will ever let their creative people have a 15 hour work week. Nothing stops the employees from retiring at 40 though ;)

discuss on HN

* These are the reasons I've seen:

  • Wealth inequality: the top X % are reaping the benefits, the rest don't have the wealth to work less
  • Work as leisure: people enjoy work more than their free time
  • Bullsh*t jobs: we invent unimportant work to keep busy
  • Consumerism: we want more and higher quality things

Wednesday, 21 May 2014

Browsers Should Shame ISPs for not Providing IPv6

I propose something like in this image, so browsers put some pressure on ISP's and webhosts to adopt ipv6

For years we have heard that the Internet is about to break because the world is running out of ipv4 addresses. Sad, but not true. In my career as a programmer I've seen dirty hacks grow into extremely elaborate systems, despite the fact that they were originally set up to solve only moderately difficult problems. NATting is certainly a dirty hack, and now that ipv4 addresses are running out we're going to see lots of multi-level NAT gateways, which are certainly elaborate systems if they need to provide high availability.

The problem is that these elaborate systems are difficult to maintain, and with each additional change you sink deeper into the quicksand. So even though I believe engineers will be able to work around the problems in ever dirtier ways it would be really good for the Internet if we could throw away the old system and switch to ipv6 en masse.

Who can help us escape from the quicksand that is ipv4? The Government? A plane? Superman? No. Browsers.

The Internet is in a catch-22 situation concerning the ipv6 switch. Consumers don't feel the need to switch as long as they can still use ipv4, and producers don't want to invest in ipv6 while all consumers still support ipv4. We should keep in mind that the general public is completely unaware of this issue. When I tell my fellow programmers that I've set up an ipv6 tunnel at home, they don't say "wow that's amazing, can you help me set one up too?". No, they say "cool story bro", shrug and walk away. In a way that's logical as the end-user gains nothing by the switch, everything worked just fine in the dirty-hack system. The only winners are the Internet engineers. But if this is how programmers who know about the issue respond, then what does the average consumer know about ipv6? Nothing.

Now what does everyone, including the average consumer, use to connect to the Internet? That's right: browsers.

What if your browser displayed a green endorsement like this (I know, I'm not a designer).

And if your ISP (or the server) is stupid:

The information block should also contain a link to a site displaying why ipv6 is good for the Internet. I know that if I were a CEO, I would not like customers to see this on the website of my company.

Good idea? Bad idea? Discuss on HN.

Tuesday, 20 August 2013

I moved a 4000 line coffeescript project to typescript and I liked it

TLDR: jump straight to the TypeScript section
About 8 months ago I started a new complex web app in javascript, and it quickly grew out of hand.
It had:
  • a server with routes
  • a singleton object with state, logic and helper functions
  • a bunch of similar plugins that extend functionality
  • the singleton object lives both on the server and on the client
Very soon I decided that javascript allowed too much patterns. I wanted modules, classes and easy binding on the this keyword.
Someone recommended CoffeeScript and I went with it.
The codebase expanded to about 4000 LOC in a matter of weeks.

So CoffeeScript hm, what about it?

These are my experiences after maintaining a non-trivial coffeescript application for a couple of months.

  • Programming was quicker, stuff you want is already in the language. (classes, inheritance, array comprehension, filters)
  • Less verbose.
  • for k,v in object
  • fat arrow
  • "string interpolation #{yes.please}"
  • fat arrow is very similar to thin arrow, git diff thinks this sucks
  • syntax. The attempt of avoiding braces is horrible. Function calling is a mess.
  • It smells like ruby. I dislike ruby with a vengeance.
  • no more var keyword? This is disturbing and error prone, given its significant subtleties in javascript.
  • everything is an expression? I like to be explicit about return values kthnxbye.
The result: a buggy codebase that feels scary, lots of unsafe monkey patching, coffeescript that seems to disagree with the idea of coffeescript.


When I started this codebase TypeScript had just launched. I deemed it a bit too experimental to work with, but last weekend I decided to give it a go. On Sunday I did git checkout -b typescript-conversion, installed the typescript syntastic plugins and started up vim. Fourteen straight hours of refactoring later it was done and 4238 lines of coffeescript had turned into 6145 lines of typescript.

I compiled all the .coffee files to .js files, removed all the .coffee files from the repo, and renamed all the .js files to .ts. Technically I was already done, as js is a strict subset of typescript, but doing everything typescript style was a bit more work.

Here are my experiences.

  • fat arrow: removed almost all uses of self = this.
  • static type and function signature checking. I immedeately fixed about ten hidden bugs thanks to this.
  • classes and modules have never been easier
  • linking using tags
  • compiling linked files to one concatenated file out of the box using tsc --out
  • aim at ecmascript 6 forwards compatibility
  • slower tooling (vim syntastic takes about 3-5 seconds after each buffer save)
  • no way of doing stuff like for k, v in object
  • no string interpolation
  • no automatic ecmascript 3 compatibility layer (monkeypatching Array with indexOf etc.)


I really really really like TypeScript. My project feels really clean, I see lots of room for improvement (and this time I know where to start). For larger codebases typescript will greatly improve maintainability.

If you work on a large codebase you can either automate testing, enforce developer discipline or move to static typing and a compile step. I think the last option is greatly preferred.

Wednesday, 10 April 2013

Storing branch names in git (so not Only the Gods will know which branches you merged)

tl;dr: a better git branching workflow under the bold sentence below.

So yesterday I enjoyed some of the Git Koans by Steve Losh ( While I believe that the criticism is valid, I think it also misses the point about git.

Forget what you know about svn, mercurial or other version control systems when thinking about git. The fact that most people use it to version source code is irrelevant. Today I thought of a way to use git to store/merge and distribute the nginx configurations across our front facing webservers. We use git to version /etc , etc. So the big question here is: when will we see a perfect version control system for source code on top of the git data structure/algorithms?

I've been using git for almost two years, but only since reading the Pro Git book 6 months ago did I really understand any of it. The thing about git is:

a git repo is a data structure and the git commands are algorithms, and that's it.

But anyways, I wanted to offer a solution to the problem posed in "Only the Gods".

It is quite simple: when you create a branch, always start with creating an empty commit in that branch, like so:

git checkout -b BRANCH_NAME
git commit --allow-empty -m "created branch BRANCH_NAME in which we will ...."

That way, after merging, you will always have a reference to the branch names in which the commits were made. Seems like a good way to use the git system for versioning source code :)

Discuss this on HN.

Some more explanation if you need it:

In the git data structure branches are pointers to commits. Much like a variable in most programming languages, its name is irrelevant to the rest of the system. (I can hear you thinking: "But I work with variable names all the time, to me they are important!". Yes you're right, but like I said: forget about it and focus on what the algorithm does, at least for now). When examining a commit history using tig, we can trace the two commits involved in a merge operation:

You can see that the text of the merge commit already holds the branch names, but this is not guaranteed when you're merging remote branches or want to have a custom commit message.

Tuesday, 15 January 2013

WTF Google, you stole my $5 - Update

A couple of weeks ago I posted about this issue I had with my Google Chrome Developers account. After a minor public outrage (#1 on Hacker News for an hour or two) I thought: this can go two ways, either they fix the problem perfectly right now and send me apologies and a free phone or something, or they keep ignoring it. Either way: I should tell the world how it went.

Well, the actual outcome was somewhere in the middle. After three days of silence I got a reply on the Chromium Apps discussion board from Google's Developer Advocate Joe Marini. He apologized, said that the issue had been fixed and so it had. No big deal and certainly no free phone ;) (Bummer, my old HTC Vision just broke down)

I published my extension straight away. The status turned to published but it did not show up in search and neither was I able to install it. I waited, after two days hit published again. Now the status turned to pending review. I waited two weeks. I posted this new issue on the Chromium Apps board. After half a day Joe replied once more and what do you know, after a couple of hours the issue got fixed.

This morning my extension finally got published! Not exactly a smooth process though.

Two things struck me:
A) this Joe guy responds to almost all the questions on the Chromium forums. Apart from his replies Google is dead silent.

B) Google handled this in a rather off-hand manner. Does that mean that they're not really working on improving this process? Would it have been an indicator of improvement if they would have made a bigger deal out of it? There is so much silence from their side that it is really hard to make out what is actually happening.

Well, at any rate, you now know that my app finally got published after 2.5 months. And you've learned that complaining works every now and then, however much it goes against your personality. It certainly does not come natural to me.

You can discuss this on HN