The Freaky bl.aagh

html_entities_ascii() just appeared under Projects; this is a very fast HTML escaping function for binary data and ASCII text. We will be using it to generate NZB fragments from our new database.

That's right, Newzbin is writing it's own database for Message-ID's; this will remove about 5 billion rows from our MySQL database and putting them in a form which can be accessed and stored more efficiently. We're benchmarking it at about 110,000 inserts/sec, which includes checksumming every page (for extra paranoia, considering we're using ZFS).

Once this is done, MySQL can concentrate on our file and other smaller tables, and we can look to extending our retention to match the recent spate of NSP's announcing 365 days and beyond.

It's funny how C is repeatedly turning out to be useful for a website mostly driven by PHP and Ruby; Newzbin depends on quite a lot of our custom C services and libraries. Let's enumerate some of them:

  • pencil; inspired by the pen load balancer, pencil is a buffering service. We use it to talk to PHP daemons over FastCGI; pencil fully reads the PHP response and buffers it for slow clients, so PHP can get on with other requests instead of hanging waiting for a client to read() a response.
  • searchd; our first generation search accelerator service. It splits titles and subjects into 2 and 3-tuples and indexes what files and reports contain which pairs and triples of letters. It's used entirely for v2.
  • resultd; our second generation search accelerator; or rather, a fully blown search engine, designed specifically for our datasets and our queries. This is what drives listings on v3. Watchdog is also driven by a releated service, part of the same codebase.
  • msgidd; a bloom filter service; every Message-ID we add goes via this service. Its job is to remember what it's seen, so we don't insert duplicate segments into the database. Prior to an upgrade of a backend server, it has been running non-stop for about 600 days.

Our new database is creatively named msgiddbd; Message-ID DataBase Daemon.

PHP Errors Raised