Wiktionary talk:Frequency lists

Number 1
WTF? What word is NUMBER 1?

As I've just found some words at 90000+ with capitals, does that mean these lists aren't even counting Age and age as the same word?


 * My earlier iteration was counting them separately, yes. I am re-running the whole process now case-insensitive and will post those results soon.  -- [ Connel MacKenzie] 17:53, 22 October 2005 (UTC)


 * The latest iteration returns only the most common capitalization variant, ranked by the combination of all capitalization variations. The count for age and Age are added to determine the rank that age is then listed as.  I'm taking notes on refining it so that age., age', 'Age etc. are also combined into that count.  I think I'll have to use a manual exception list for entries like 2.  --Connel MacKenzie T C 22:06, 8 December 2005 (UTC)

Couple questions. It it considered valuable to redirect things like i've to I've? Most of the red links in the first 10,000 now are just capitalization issues like this. If you type i've into the search box it sends you to the capitalized version, but the linked text shows red. So are we going on the redirects are cheap and no harm idea or do we just not care if the Gutenberg list looks complete? 2) Would it be hard to simply filter out the Gutenberg introduction by say automatically skipping the first x lines? It seems that would lead to a much more valuable list. Thanks - Taxman 17:35, 4 January 2006 (UTC)

nbsp?
Why has someone included nbsp?


 * The program counted it because many files had &nbsp;s which snuck through early versions of the automated HTML-stripping. (Mostly in the soap opera transcripts, many of which have never been touched by human hands.)  If you squint hard at the end of the list, you should even be able to find a few entries like ecirc and uuml. Keffy 23:17, 22 March 2006 (UTC)

I may be going mad...
I'm very tired but I've just removed the following sentence:

Fewer 17th century religious tracts and histories of Prussia. More vampires, starships, amnesia, homicide investigations, and cocaine.

If this has anything to do with the entry then perhaps it could go back in...

Project Gutenberg
It might be worth mentioning that Project Gutenberg constists mostly of books published before 1923 and so it isn't exactly representative of words used today. See http://www.gutenberg.org/faq/C-10


 * True. That would be a good addition - please be bold.  --Connel MacKenzie 17:07, 24 August 2006 (UTC)

The Gutenberg section says:
 * the boilerplate warning for Project Gutenberg appears on each of them

Would this be http://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License ? Any idea how this has affected the results? Denishowe 10:58, 15 August 2007 (UTC)
 * Seeing that "gutenberg" is in the top 300 I'd guess it has significantly skewed the data.


 * How hard would it be to exclude the boilerplate or do a one-time analysis of the boilerplate and of other recurring material (including any e-book cruft blockquote=#551 and make appropriate adjustment? Would it be possible to exclude all proper nouns? DCDuring TALK 12:16, 27 January 2008 (UTC)

Wordcount.org
Surely http://www.wordcount.org counts for something? If I recall correctly it studies the word frequencies in online communications, eg email, forums etc.


 * Too bad about that copyright, though. --Connel MacKenzie 21:18, 4 December 2006 (UTC)
 * No, it is hitting the British National Corpus only, apparently. --Connel MacKenzie 21:18, 4 December 2006 (UTC)

Filters
I would be nice to see lists that exclude pronouns and prepositions.

Observation comparing the 2 lists
In most cases present form of a verb comes before the past form in the TV list but the past form precedes in the PG list.

How does one alphabetize?
It struck me ass odd that 'a' was last in order of all the words with initial a. Then I noticed that 'down' precedes 'do', and so on. Please confirm that a# preceeds ab, and so forth. Kjaer 08:18, 25 August 2008 (UTC)

Duplicate in "Most common words (TV and movie scripts)"
The word "brian" occurs in place 1670 (freq. 1146) and in place 8254 (freq. 106). There are no other duplicates among the 41284 words.

I might also note that some proper names like "Theresa" and "David" are capitalized, and others like "brian" (both occurrences) and "alice" are not. However, after converting the list to all lower case, there are still no further duplicates other than the "brian" mentioned. 207.172.220.155 17:49, 13 June 2009 (UTC)

Is there a German lemmatized word frequency list?
I'm interested in seeing a German word frequency list where words are counted by lemma. Even better if there were also other German word lists grouped by word-families such as  the British National Corpus has done

Search
Is there a way to search engine to find a specific word? Sure I don't have to check every list individually to find out how often a word is used. 207.6.241.10 21:23, 8 May 2010 (UTC)
 * Yeah, just go to the Wiktionary entry for that word. If it's high enough on the list, it will say in the entry. 174.47.84.201 22:30, 6 July 2010 (UTC)

I have a further request: Is there a way to scan film scripts for specific phrases? I have looked around the subtitle sites but these frequency lists are the closest i've got so far.

Splitting list
Anyone have a problem I I completely split this page by language? --Bequw → τ 20:58, 21 July 2010 (UTC)


 * You mean into separate pages? I think that would defeat the point of this page, which is to have them in one place and be able to see which languages are without such lists. LokiClock 17:48, 5 February 2011 (UTC)

Old Norse MeNoTA
I've assembled a list ranking Old Norse wordforms by frequency based on existing lists for a small number of texts. The list distinguishes alternate spellings as separate wordforms and, due to the small sample size, gives disproportionate ranking to certain names of characters from the texts. This list needs additional processing (see Top missing words), and a large portion of these words are not found on Wiktionary. LokiClock 12:14, 6 January 2011 (UTC)


 * I'm posting it here because I don't know if including variant forms disqualifies it as a word list. LokiClock 17:38, 7 April 2011 (UTC)

Double appearance
The word "remuneration" appears twice in the Project Gutenberg lists of 2005-08-16: What gives? --Lambiam 13:37, 16 November 2013 (UTC)
 * Wiktionary:Frequency lists/PG/2005/08/20001-30000 (range 29901 - 30000)
 * Wiktionary:Frequency lists/PG/2005/08/90001-100000 (range 91401 - 91500)

Dutch frequency; doubts
As a native Dutch speaker (and as a human with sufficient intellect) I have to question the sources mentioned.

The novel 'Max Havelaar' hardly qualifies as a benchmark for modern day Dutch. What practical purpose would using any single novel have? Let alone one from 1860. Is anyone one writing style to be the median for any language?

The University of Leipzig, while undoubtedly more knowledgeable than I, doesn't clearly state their source at all. I find it very hard to believe that a word like "frank" appears in the top 100 of most frequently used word in the Dutch Language. It's relatively antiquated. I myself have never heard it used outside of when it's used as a name. "1" hardly qualifies as a word to me, and "a" isn't even a word in Dutch, not to my knowledge anyway. Neither are "fr", "s" and "t". I have to seriously question both the validity and usefulness. OmikronWeapon (talk) 07:41, 9 July 2015 (UTC)

I believe the "s" comes from the genetives like "Anna's boek" and the "t" from the abbreviated, unstressed "het" as in " 't huisje aan de Schelde". For more recent information on many languages, including Dutch, see Opensubtitles Word Frequency lists August 2016 - https://invokeit.wordpress.com/frequency-word-lists/ [but note that I'm not sure about the exact copyright/copyleft status of that info] Jansegers (talk) 15:38, 24 October 2016 (UTC)

Raw lists
Is there a way to download the raw lists or the source-code to make them? All I can see is the lists as wiki pages, which is not ideal for mechanical consumption. 207.179.110.10 02:24, 5 June 2016 (UTC)


 * They're from all different sources from all different data.--Prosfilaes (talk) 04:19, 6 June 2016 (UTC)