User talk:DCDuring/1000EnglishEntries

The code
—Ruakh TALK 19:07, 4 July 2012 (UTC)


 * Wouldn't this tend to get more pages on the list from toward the end of the the XML file? The 10000th page had a one in ten (1000/10000) chance of getting into  (though it might be later ousted); the 1st page, however, would almost certainly be ousted by the time you got to the end of the XML file.


 * ...unless the XML file itself is in pseudorandom order (is it? I have no idea), in which case this works, though then you can just use  without the  . &#x200b;—msh210℠ (talk) 01:35, 5 July 2012 (UTC)


 * The XML dump is not in pseudorandom order, no; it seems to be in order by page-ID, which is, to a close approximation, order by when the page was initially created. But I think this algorithm is sound; if there are 1000+n English entries, then each one should have a 1000/(1000+n) chance of being in the sample. Here's a sketch of a proof by induction:
 * If there are 1000+0 = 1000 English entries, then each occurs once in the sample; that is, each has probability 100% = 1000/(1000+0) of being in the sample.
 * Assume that the assertion is true for a specific value of n. Then, what happens if there are 1000+n+1 English entries? Well, first we use the algorithm to create a 1000-member sample Sn out of the first 1000+n of them, such that (by assumption) each of those has probability 1000/(1000+n) of being in Sn. Then, according to the algorithm, we create our sample Sn+1 by giving the last English entry a 1000/(1000+n+1) chance of displacing a randomly-selected member of Sn, or put another way, we give each member of Sn a 1/(1000+n+1) chance of being displaced, that is, a (1000+n)/(1000+n+1) chance of not being displaced. To recap: each of the first 1000+n entries has a 1000/(1000+n) chance of making it into Sn, and then, if it makes it there, a (1000+n)/(1000+n+1) chance of making it into Sn+1. (And of course it has no chance of making it into Sn+1 unless it makes it into Sn.) So its total chance of being included in Sn+1 is 1000/(1000+n) &times; (1000+n)/(1000+n+1) = 1000/(1000+n+1), Q.E.D.
 * (Full disclosure: This algorithm is not original with me. I was already familiar with a well-known algorithm for selecting a random line from a file, and all I did was extend it to selecting a fixed-size random subset. I'm certain I'm not the first person to do that; in fact, Google suggests that even this "extended" version might be in Donald Knuth's The Art of Computer Programming, in which case it too is very well known.)
 * —Ruakh TALK 02:14, 5 July 2012 (UTC)


 * The proof looks correct (and elegant) . Sorry for wasting your time. &#x200b;—msh210℠ (talk) 04:31, 5 July 2012 (UTC)


 * No worries. :-)  —Ruakh TALK 11:59, 5 July 2012 (UTC)