Thoughts and experiments in programming by a startup CTO in Amsterdam.
Why this blog?
Whenever I come across a problem I look on google for answers. Every now and then the answer is not straightforward and I have to do some ingenious thinking. This blog allows me to share my finding with other developers.
Last May I was working on hobby project similar to this: https://github.com/zakjan/cert-chain-resolver/ . As I found the cert-chain-resolver project a couple of days later I did nothing with the results, but I got some nice comments on how I used 1 VM to download & process 10TB in a couple of hours on this HN thread recently so I decided to do a write up on the process and publish the data. See the parts below: Part 1: downloading 10TB of metadata in 4 hours Part 2: fetching a ****load of certificates Part 3: playing with the data Total costs My approach was somewhat different from the github project above, instead of using the AIA extension I wanted to brute-force the solution by finding all known intermediate and root certificates in advance. Based on the checksum of the issuer/subject fields I could look up which certificates "claimed" to be the signer of the certificate and then using the signature I could filter out which ones actually were. You can us
I was recently trying to work with the python package warcio and feeding an s3 object from the common crawl bucket directly into it. r = s3.get_object(Key='crawl-data/file....', Bucket='commoncrawl') for record in ArchiveIterator(r['Body']): pass However, this fails with the error: self.offset = self.fh.tell() AttributeError: 'StreamingBody' object has no attribute 'tell' The reason is that boto3 s3 objects don't support tell . It's easily fixable by creating a tiny class: class S3ObjectWithTell: def __init__(self, s3object): self.s3object = s3object self.offset = 0 def read(self, amount=None): result = self.s3object.read(amount) self.offset += len(result) return result def close(self): self.s3object.close() def tell(self): return self.offset You can now use this class and change for record in ArchiveItera
Universally Unique Identifiers (UUIDs) are great. I love how you can tell the progress of a batch job just by looking at the current UUID. If it starts with 0... , the task is less than 1/16th done. If it starts with 7d.. , we're almost halfway there. At ff... we are nearing the end. The fact that you can tell this rests on two principles: 1) you sort your jobs by their uuid and 2) UUIDs are random, as in, distributed uniformly . However, last week, I noticed a strange thing: a clearly visible pattern in the uuid column of a database table. It should be impossible, but there it was. It looked like this: > SELECT uuid FROM example ORDER BY id; 4f95de28-0fd1-48db-ad2e-34ecd169c483 4331cb9e-1d91-11e9-be2c-45923c63e8a2 4331cc4c-1d91-11e9-be2c-45923c63e8a2 4331ccec-1d91-11e9-be2c-45923c63e8a2 4331cd7e-1d91-11e9-be2c-45923c63e8a2 c7e2f124-f6ba-4434-843f-89958a7436ec 4331ce10-1d91-11e9-be2c-45923c63e8a2 4331ce9e-1d91-11e9-be2c-45923c63e8a2 4331cf28-1d91-11e9-be2c-45923c