New Diablo Header System - Avi Freedman's Tech and Biz Topics

(sent to diablo mailing list)

Like many, I was frustrated with the overview system functionality in diablo as retention volume started to grow a few years back.

The real issue started back in the 32 bit days when the Linux mmap() need to allocate contiguous ram per process was biting us with boneless and other groups.

Our first fix was to just go from mmap() to file semantics – open(), close(), pread64(), etc. We actually saw that either go faster or at least not slower than the mmap()-based access, and have been running that for 4 years.

About 3 years ago, I finally got really pissed off at the monolithic overview indices, largely in combination with dexpireover sometimes getting confused and eating data files just because the overview index got corrupted or dexpireover got confused. Secondarily of course, it would be nice to have a solution that allows for retention of articls from the moment the group is created (the standard dexpireover throws away any data files that dreaderd didn’t store index data for).

What I wound up going with was to put the overview index data for each article INTO the data files.

All that’s left in the over.* file is the OverHead struct, basically:

-rw-r--r-- 1 news news 276 2007-10-09 10:34 over.0.80011c3e.6087a5f3
-rw-r--r-- 1 news news 276 2007-10-09 10:34 over.0.8100543e.40107707
-rw-r--r-- 1 news news 276 2007-10-09 10:34 over.0.829a403e.5597596f
-rw-r--r-- 1 news news 276 2007-10-09 10:34 over.0.836c1d3e.0d9a576d
-rw-r--r-- 1 news news 276 2007-10-09 10:34 over.0.8396433e.6478c8a5
-rw-r--r-- 1 news news 276 2007-10-09 10:34 over.0.83c8ed3e.603853f4
-rw-r--r-- 1 news news 276 2007-10-09 10:34 over.0.8463b93e.3c18a685
-rw-r--r-- 1 news news 276 2007-10-09 10:34 over.0.8472e03e.34c925c1

(We ran a conversion program to convert everything to the new format, so the oldest stuff all has the same date…)

Then in each data file it is:

#define VERSION4_PADDING_SIZE 128

So to make a new file:

  ftruncate( od->od_HFd,
             VERSION4_PADDING_SIZE +
               ( ov->ov_Head.v.dataEntries *
                sizeof(OverArt) )
           );

And to get an OverArt:

  pread64( ov->ov_HCache->od_HFd,
           oa, sizeof(OverArt),
           VERSION4_PADDING_SIZE +
                 ( (artno - ov->ov_HCache->od_ArtBase) *
                   sizeof(OverArt) )
         );

Now, the major downside of this approach is that canceling article headers becomes much harder (have to touch O(# of data files) to cancel by mid), which we deal with by not canceling headers…

Cancelling article bodies has been fine for dealing with DMCA requests. We’ve actually commented out the code to do header cancellation, though it could be supported. Doing scans for 4 billion-article groups, with data files that are sometimes stored compressed (that’s another article), would take a loooong time.

Advantages of the approach include:

- You can NFS-mount the headers. We’ve gone diskless on our readers this way. There are other approaches to scaling with dreaderd. My favorite is overview proxying, but that’d be very complex to implement in dreaderd.

We actually have a multi-threaded reader/finder with overview proxying, but just thinking about implementing overview proxying in dreaderd caused blood to come out of my ears, and NFS mounting has been working well for us for a few years. Now, that means having a dreaderd server NFS re-export its volumes to other readers. HA is left as an exercise for the reader. If you’re at really large scale you can even go SSD, though 18TB of SSD is probably pushing things. Or you can use a compressing filesystem on SSD (what we’re experimenting with in a new deployment). Nehalms are fast. For optimal results, you may even want to do a unionfs and then migrate ‘mature’ data files to the SSD filesystem(s) one they’re full of entries.

- Start being able to store (forever) headers for newly created groups.

- No dexpireover process. No rebuilding overview indices.

There is a program, nbupdate, that can be used to scan the disks and update the NB counter if one does do expiration, and there is a janky perl script that is lying around somewhere to cancel by deleting data files, but soon after we converted, the infinite retention race began, so this hasn’t been an issue for the last few years.

- You can compress the data.* files (if you can do on-the-fly decompress, and get compression on the index data as well)

The code is mostly in group.c, with a few lines in nntp.c. The overview conversion program and nbupdate are C. The scripts are all perl.

But…

- This was implemented as running specification that sort of just worked, so has never been prettified

- Also, it has only been tested (and scripts are hard-coded) for the older hash system we use (over.* and data.*)

- The overview conversion program can create overview index entries for data files. When we did the initial conversion we had actually stopped deleting data files as part of dexpireover a year before, and after the conversion we could reference those articles.

So… the conversion program does use the oa_ArtSize and oa_TimeRcvd from the old overview index if it is there, but if not it fudges. We didn’t get any complaints from fudging oa_ArtSize, but it was a concern. – And, as mentioned, it’s not beautiful code, especially the overview conversion and nbupdate code.

So…

If there is interest, I’d be willing to give the code over to someone to prettify, or potentially do it myself. If noone cares any more (which I suspect may be the case given list traffic and the insane current scale of things limiting the people who care about doing up years worth of overviews), that’s fine too of course.

Avi Freedman
readnews.com

(Postscript: Aug 29, 2011 – no one cared. The 3 other remaining large Usenet sites all did their own modifcations, and most of the other folks who wrote me about this wound up outsourcing anyway.)