Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AWS

Last May I was working on hobby project similar to this: . As I found the cert-chain-resolver project a couple of days later I did nothing with the results, but I got some nice comments on how I used 1 VM to download & process 10TB in a couple of hours on this HN thread recently so I decided to do a write up on the process and publish the data.

See the parts below:

My approach was somewhat different from the github project above, instead of using the AIA extension I wanted to brute-force the solution by finding all known intermediate and root certificates in advance. Based on the checksum of the issuer/subject fields I could look up which certificates "claimed" to be the signer of the certificate and then using the signature I could filter out which ones actually were. You can use my online service like this:

cat CERT.crt | curl -F myfile=@- -X POST

You then get the entire chain back on your prompt.

Part 1: downloading 10TB of metadata in 4 hours

Brute-forcing always leads to interesting optimization challenges. In this case I needed to scrape as many certificates from the SNI-enabled web as I could, and for that I needed to find the longest list of domain names on the Internet available. There is the Alexa top 1 million, but I thought something better should be available. I could not find anything as a direct download, but I did stumble upon the Common Crawl project. They scan a huge amount of urls and publish the results on S3. Luckily they split the metadata (headers etc) from the actual data, and the url is listed in the metadata. This could be filtered down to a list of all urls, which could be filtered down to a list of all domain names. Certainly an approach that is promising as I don't need the entire data set, just the metadata will suffice.

Still, the amount of metadata of one scan is about 10TB. Downloading that from my home with a 60Mb connection would take about 15 days to download and would probably violate the Fair Use Policy of my ISP.

However, the data is on S3 and AWS offers 32 core, 10Gb instances with free data transfer to the c3.8xlarge machine. This leads to a theoretical processing time of 2 hours if the network is the bottleneck, or about a 100x speedup from my home connection! The machine costs about $1.70 per hour. I would not need any hadoop/map-reduce or any type of clustering to get results.

Now I'm a big fan of using simple unix commands to get complex tasks done. Using the "split" command I split out the list of metadata files into separate chunks of about 100 files. Each line of these chunks were processed with a pipeline like this

curl -s$i | zgrep '^WARC-Target-URI:' | cut -d'/' -f 3 | cut -d'?' -f 1 | sort | uniq

This filters a 150MB gzipped file down to about 100KB of raw domain names. I used a bit of python with gevent for job scheduling, but looking back "xargs -n 1 -P 50" would have done just as well. Even though the chunk results were only 100KB, after processing thousands of them the 8GB root partition was filling up, so I started "rsync --remove-source-files" in a loop to transfer the result files to my laptop. I could have stored the results on the EBS volume, but then after all hard work was done I would need to wait until the results (10GB) were downloaded to my laptop. Better download them incrementally and throw away the $1.70/hour machine the minute I was done with it. Not that an extra $1.70 would break the bank, but I challenged myself to keeping the efficiency at a maximum.

I used "nmon" to determine the optimal amount of parallellization, which turned out to be about 30 simultaneous downloads. The bottleneck turned out the be CPU rather than network, by a small margin. Unpacking the .gz files is expensive! The 30 gzip commands were at the top in "top". At 30 simultaneous downloads I was doing 95% CPU on 32 cores and about 5Gb of network traffic. Initially I launched the in the wrong region so latency to S3 was pretty high, that had a pretty big influence of performance but I didn't write down the exact numbers. For the record, the data set is hosted in us-east-1.

At my day job I regularly work with about 30 physical machines with 288GB RAM and 16 cores each, but I think for those 4 hours in May this "little" VM did more than all that infrastructure combined.

Part 2: fetching a ****load of certificates

So now I had a whole lot of files with a whole lot of domain names each with many duplicates. I needed to get that down to one big list of unique domain names. I'm usually on the McIlroy side in Knuth vs McIlroy, in that I prefer simple commands that get the job done. In this case, doing

cat segments/* | sort | uniq

would be preferable. The problem was that the original data set containing all duplicates did not fit in my laptop's memory and "sort" needs to keep the whole thing in memory, so I solved it with a bit of python and a marisa_trie where duplicates are removed. Looking back, maybe "sort -u" has the same optimization? Anyway, I also learned that setting "LC_ALL=C" significantly speeds up "sort" (about 4x in my case) and can be used if you're not using special characters.

So now I had 26 million domain names. How do we get the X509 certificates from those servers? I created a small program in go that does just that:

I executed it with

cat domainnames | go run fetch.go

This ran for a couple of hours on my laptop with quite low CPU and network usage. It does nothing but a SNI enabled SSL/TSL handshake and stores all the certificates under a filename that's the sha1 sum.

Now I had about 1.4 million X509 certificates.

Part 3: playing with the data

Here are the links for:
- archive of certificates (tar.gz torrent)
- archive of domain names (.gz torrent)

The following program prints hashes of the public key field, the entire certificate, the subject and the issuer.

Using the output of

tar -zxOf certificates.tar.gz | go run print-certs.go > hashes

I was able to find some nice things:

  • People reusing the same public/private key across multiple certificates. Some guy in Russia did this 174 times.  sort hashes | uniq -w 30 -c -d | sort -n | tail
  • Top 3 CAs are 3: RapidSSL CA, 2: COMODO RSA Domain Validation Secure Server CA, 1: Go Daddy Secure Certificate Authority - G2. cat hashes | cut -f 4 -d ' ' | sort | uniq -c | sort -n | tail
  • 59693 self-signed certificates with subject C=US, ST=Virginia, L=Herndon, O=Parallels, OU=Parallels Panel, CN=Parallels Panel/
You could use the data set to scan for insecure public keys (< 1024 bit length), outdated certificates, which dates are most popular to create certificates or how long before revocation people tend to renew their certificate, to name a few ideas.
I managed to design, run and process the results of this experiment in one weekend. The total AWS costs were just under $10 for running the $1.70/h machines for 5 hours. Getting data out of AWS is pretty expensive normally, but as I was only getting data into my instance the network costs were a couple of cents at most. The rest of the experiment could easily run on my laptop and home Internet connection. All in all I think this is a great example of using AWS for CPU/network intensive experiments and I certainly got my money's worth of computing power!

If you have a any questions or ideas for improvement, I'd like to hear it in the comments on Hacker News!

Popular posts from this blog

AI programming tools should be added to the Joel Test

The unreasonable effectiveness of i3, or: ten years of a boring desktop environment

The long long tail of AI applications