Wiktionary talk:Frequency lists/PG/2005/08/1-10000

One more: in the first subsection titled 1-100 there are only 99 words. But in the other subsections we have 100 words.

Here's a few problems I can see:


 * 1) Apostrophe is included as a word character in English in positions where it is only rarely permitted such as the first character: ('tis in 2601-2700 is an exception)
 * 2) * 'I (701 - 800)
 * 3) * 'The (2301 - 2400)
 * 4) * 'You (2801 - 2900)
 * 5) * 't (3201 - 3300)
 * 6) * 'And (3901 - 4000)
 * 7) * 'What (3901 - 4000)
 * 8) * 'It (4101 - 4200)
 * 9) * 'Oh (4201 - 4300)
 * 10) "." seems to be blindly included as a word character even when it's not surrounded by a letter on both sides or is accompanied by an apostrophe:
 * 11) *it.'  (6101 - 6200)
 * 12) *you.'  (7501 - 7600)
 * 13) *me.'  (7801 - 7900)
 * 14) URLs are present - is it possible that Gutenberg's newsletters are being included as well as the out-of-copyright books?:
 * 15) *pobox.com (5401 - 5500)
 * 16) *gutenberg.net (6201 - 6300)
 * 17) *pglaf.org. (9801 - 9900)

Most of these are easily filtered out. In the first case I think we'd be better off with fewer false positives at the cost of a small number of false negatives.


 * Thanks. I'd just like to say, gah!  I had punctuation characters converted to spaces in an earlier iteration; I'm not sure what I goofed up on this last round.  Oh wait, these frequency lists are not the latest version, and don't correspond to the template:rank entries.  Double gah!  --Connel MacKenzie T C 22:09, 8 December 2005 (UTC)

I was searching for some common 3 letter words, and here's what I found:

For j: joy, job, jag, joy, jos, jeg, jar and jaw...but no jug! For k: key, kun, kan, kam and kin...but no kid! For z: zoo, zou and zal...but no zip!

Hard to see how this can really be useful as any sort of approximation of the 10000 most commonly used English words, which is what I was loooking for.122.107.225.220

Why's 'Gutenberg' #243? How often does that come up?
 * Presumably because the scanner includes the copyright text at the start of every book (that contains Gutenberg). Conrad.Irwin 03:14, 19 March 2009 (UTC)

Why is 'la' in the first 100 words? It refers to http://en.wiktionary.org/wiki/la where it has number 481 (and even that is a very low number for the meaning of 'la' as a syllable used in solfège (music)  Lcla

Crappy
This page is really crappy, to be blunt. Can we delete it? --Newfriendforyou 00:27, 23 July 2011 (UTC)