Wiktionary talk:Frequency lists/Dutch wordlist

Numbers probably inaccurate
Some of the editors have stated that they adjusted the counts. Not clear whether others did so. So the counts may not be accurate. The edits have removed lots of entries, so I adjusted the "number of words" accordingly.伟思礼 (talk) 04:52, 16 October 2020 (UTC)

Cleanup

 * "Lk" is not a Dutch word. The first letter is in fact a capitalized "i". → Added its frequency number to the number of "ik" (first-person singular).
 * Same for the non-existing word "lemand" → "iemand".
 * If the first letter of "één" is capitalized, it's written "Eén" (without the first accent). → Added the frequency numbers of "eén" and "één", and moved the word to its new, higher position in the list.
 * "Oke" is a misspelling of "oké". → Added the frequency numbers of "oke" and "oké", and moved the word to its new, higher position in the list.
 * "Dr" is not a word, but an abbreviation of "doctor". → Added a dot, "dr."
 * Changed "mw" (abbreviation for "maatschappelijk werk" ("social work")) to the more likely "mw." ("Mrs").
 * "John(ny)", "Harry", "Charlie", "Donald", "Diane", "Flynn", "Jane", "Claire", "Michael", "Mike", "Max", "Will(y)", "Christian", "Christopher", "Alex(ander)", "Ray(mond)", ""Bill(y)", "Sarah", "Jim(my)", "Phil", "Vanessa", "Amy", "George", "David", "Ruth", "Charles", "Christine", "Julian", "Jordan", "Eddie", "Paul(ie)", "Tina", "Sam", "San" and "York" are (parts of) names. → Removed from list.
 * "Sir", "Mr", "Ms" and "Mrs" are English words. Dutch subtitlers tend to not translate these forms of address. → Removed from list. (The same goes for "Miss", but since this word is used in Dutch for winners of beauty contests as well, I'm not sure if I should remove it.)
 * "The", "dude" and "new" are English words. → Removed from list.
 * "Peter" ("godfather"), "mark" ("mark"), "chris" (a kind of gymnastic skill), "daisy" (a kind of biscuit) and "joe" ("yah", "yup") are in fact Dutch words. But they're fairly unusual and therefore there's no doubt in my mind that these are actually names. → Removed from list.
 * Capitalized "Engeland", "Londen", "Parijs", "Mexico", "Amerika", "Amerikanen", "Amerikaan", "Amerikaans", "Amerikaanse", "Nederlandse", "Duitsland", "Duits", "Frankrijk", "Spanje", "CIA" and "FBI".
 * This is all the time I have. --Caudex Rax ツ (talk) 16:27, 26 July 2014 (UTC)


 * Although is technically the plural of  (twee ais = two exclamations of disappointment or pain) and  is A♯, I'm pretty confident that in the subtitles it originally was aIs:  with a capital I instead of a lowercase l.
 * ‘Ian’, ‘Abby’, ‘Jean’, ‘Julia’, ‘Stark’, ‘Rita’, ‘Bella’, ‘Bell’, ‘El’, ‘Ellen’, ‘Ed’, ‘Ron’, ‘Blake’, ‘Zack’, ‘Dick’, ‘Mickey’, ‘Mikey’, ‘Ted’, ‘Lee’, ‘Stan’, ‘Ford’, ‘Stone’, ‘Evan’, ‘Kane’, ‘Zoe’, ‘Luke’, ‘Dale’, ‘Kate’, ‘Julie’, ‘Nate’, ‘Tim’, ‘Francis’, ‘Franklin’, ‘Frankie’, ‘Anna’, ‘Anne’, ‘Annie’, ‘Ann’, ‘Vic’, ‘Eve’, ‘Fred’, ‘Kong’, ‘Lane’, ‘Grace’, ‘Green’, ‘Grand’, ‘Booth’, ‘Seth’, ‘Ali’, ‘Lois’, ‘Mia’, ‘Kim’, ‘Walt’, ‘Walter’, ‘Alan’, ‘Lex’, ‘Eli’, ‘Pam’, ‘Homer’, ‘Brad’, ‘Steve’, ‘Leo’, ‘Leon’, ‘Leonard’, ‘Marie’, ‘Vince’, ‘Vincent’, ‘Hall’, ‘Carl’, ‘Carol’, ‘Pearl’, ‘River’, ‘Omar’, ‘Rex’, ‘Andrew’, ‘Anderson’, ‘Andy’, ‘Sandy’, ‘Randy’, ‘Miranda’, ‘Amanda’, ‘Tyler’, ‘Miller’, ‘Tom’, ‘Thomas’, ‘Daniël’, ‘Dana’ ‘Bart’, ‘Albert’, ‘Robert’, ‘Martha’, ‘Arthur’, ‘Chuck’, ‘Dave’, ‘Eric’, ‘Henry’, ‘Jake’, ‘Jerry’, ‘Jones’, ‘Kevin’, ‘Martin’, ‘Marty’, ‘Mary’, ‘Scott’, ‘William’, ‘Williams’, ‘Rachel’, ‘Bobby’, ‘Roy’, ‘Troy’, ‘Doug’, ‘Brown’, ‘Jay’, ‘Tony’, ‘Anton’, ‘Nick’, ‘Nicky’, ‘Nicole’, ‘Rock’, ‘Rocky’, ‘House’, ‘Rick’, ‘Ricky’, ‘Patrick’, ‘Richard’, ‘Richie’, ‘Rico’, ‘Samantha’, ‘Carter’, ‘Jackson’, ‘Hank’, ‘Cameron’, ‘Mitchell’, ‘Jack’, ‘O'Neill’, ‘Jonas’, ‘Quinn’, ‘Teal'c’, ‘Jennifer’, ‘Rodney’, ‘Elizabeth’, ‘Chloe’, ‘Nicholas’, ‘Matthew’, ‘Wallace’, ‘Young’, ‘Jackie’, ‘Jacob’, ‘Jamie’, ‘Michelle’, ‘Neil’, ‘Jonathan’, ‘Joey’, ‘Johnson’, ‘Jenny’, ‘Lisa’, ‘Matt’, ‘Ryan’, ‘Larry’, ‘Emily’, ‘Lucy’, ‘Kelly’, ‘Kyle’, ‘Taylor’, ‘Barry’, ‘Terry’, ‘Wayne’, ‘Gary’, ‘Lily’, ‘Sally’, ‘Harvey’, ‘Molly’, ‘Jersey’, ‘Teddy’, ‘Penny’, ‘Betty’, ‘Nancy’, ‘Sammy’, ‘Ashley’, ‘Fox’, ‘Dexter’, ‘Dylan’, ‘Kenny’, ‘Sonny’, ‘Freddy’, ‘Tracy’, ‘Anthony’, ‘Wendy’, ‘Barney’, ‘Maya’, ‘Stanley’, ‘Casey’, ‘Jeremy’, ‘Lenny’, ‘Benny’, ‘Manny’, ‘Percy’, ‘Murphy’, ‘Cindy’, ‘Clay’, ‘Audrey’, ‘Riley’, ‘Scully’, ‘Peyton’, ‘Haley’, ‘Cody’, ‘Phoenix’, ‘Felix’, ‘Guy’, ‘Jeffrey’, ‘Wesley’, ‘Scylla’, ‘Kennedy’, ‘Brian’, ‘Lawrence’, ‘Lucas’, ‘Sara’, ‘Margaret’, ‘Maria’, ‘Liz’, ‘Hannah’, ‘Samuel’, ‘Barnes’, ‘Elena’, ‘Benjamin’, ‘Norman’, ‘Gus’, ‘Allison’, ‘Nelson’, ‘Ross’, ‘Barbara’, ‘Catherine’, ‘Katherine’, ‘Katie’, ‘Laura’, ‘Lauren’, ‘Brooke’, ‘Smith’, ‘White’, ‘Buddy Holly’, ‘Beverly’, ‘Bailey’, ‘Bud’, ‘Jeff’, ‘Noah’, ‘Shaw’, ‘Bob’, ‘Bruce’, ‘Chase’, ‘Heather’, ‘Harold’, ‘Harris’, ‘Hector’, ‘Howard’, ‘Beth’, ‘Joseph’, ‘Josh’, ‘Keith’, ‘Mitch’, ‘Ralph’, ‘Carlos’, ‘Caroline’, ‘Carrie’, ‘Karen’, ‘Oscar’, ‘Picard’, ‘Susan’, ‘Sue’, ‘Bauer’, ‘Eva’, ‘Emma’, ‘Louis’, ‘Lou’, ‘Lewis’, ‘Owen’ and ‘Baker’ are names and, going by Caudex's standard, don't belong on the list.
 * Jeans are (tantum plurale) or.
 * The (plural ) exists as an old unit of measurement, but it's nearly extinct. No way it would rank higher than . Also, ‘El’ is used as a part of names, Spanish ones in particular.
 * I think the abbreviation would have gotten split into e and d.
 * ‘Mickey’ could be part of (probably entered Dutch in the 20th C., always used in full) but I think it wouldn't make the top 5k.
 * I'm pretty sure ‘Ford’ doesn't refer to the car brand, since other car brands, some of which are much more common, aren't represented in the list.
 * A is also an old British unit of measurement, but I wouldn't expect it to make the top 5k.
 * Technically the verb form exists, but it's a mostly pre-1950 suffix on a probably post-1950 verb. Similarly,  exists, but I cannot imagine it being common in subtitles.
 * A is a small franc coin, I don't consider that likely to make the list.
 * The word meaning  is rare. It could also be a golf term, but the absence of  and other golf terms makes this unlikely in my opinion.
 * A grand is called a in Dutch.
 * The word means horizon, but I wouldn't expect it to make this list.
 * The word means ‘is on a rolling boil’ but that certainly wouldn't be in the top 5k.
 * There is the combination, but ‘bain’ didn't make the list, so yeah.
 * The word also occurs in the combinations  and  but ‘fame’ and ‘shame’ aren't on the list, or in the sense of the meeting room in a hotel, but that's even rarer. I think here it's either a surname or part of the name of a locality, building, room, &c., such as the Hall of the Dutch Senate or Albert Hall.
 * The word (Christmas song) is really rare and I wouldn't expect to see it here.
 * There's actually a Pearl River, but I don't think combining these gives a sensible ranking. I expected this to be part of Pearl Harbor, but even ‘harbor’ isn't on the list, so I suspect most occurrences are just personal names.
 * An is someone who is hard to convince of things or someone who is in the eyes of a Christian a faithless person. A  is a British soldier.
 * The rare word is used for a variety of wildly different kinds of bread and confectionery containing almonds.
 * A British constable is called a, but I imagine the name will overshadow it here.
 * A is an almost forgotten word for a narrow wooden beam.
 * The word is really rare, although it has recently entered Belgium's societal energy discussion. Apart from that though, you'll likely never encounter it.
 * A is also someone's handle on the internet.
 * Although is a musical genre, other common genres are absent, except for  which I suspect is mostly used as a name, or possibly as part of a toponym, here. If it were used as a musical genre you'd also expect ‘roll’ to show up in the list as part of the word.
 * The nowadays rare word is a slur for a very rich person.
 * A is a jacket though. I didn't really know what to do, so in the end I decided to deduct about 2k from the frequency to try to account for the name ‘Jack’, but this is a really rough guesstimate. Why o why did the author remove all caps? Yeah, I know he gives a reason, but the reason sucks.
 * The word is also a form of the verb.
 * A phoenix is a.
 * A is also a kind of knitted sweater, but coordinating terms aren't on the list and the word is rare.
 * A teddy is a.
 * A is also a coin, but it's rare and if used as such the plural is  and this isn't on the list even though it would be much more likely to show up.
 * The is the designated driver, and bob is also a term for a hairdo. But the name is far more common.
 * An is of course also an Academy Award. But usually it's a name.
 * An or  is also a kind of short little apron.
 * The rare word is street slang for no or not, and  are two languages with a combined total of about 1½k speakers.
 * A is a midwife's assistant or a form of the verb, to dry-nurse or to swaddle. Yeah, it's rare and antiquated. And I've got a hunch that a lot of mentions are actually part of ‘221B Baker Street’, which should be removed for the same reason we've removed all those names.
 * Although is a non-standard plural of the letter  (it's in the Green Booklet, but Language Phone states only z has two plurals) and  means, I think this is really just ls:  with a lowercase l instead of a capital I.
 * Mn is a misspelling of.
 * , and  are abbreviations.
 * ‘com’ is not a Dutch word, it's primary use is in URLs.
 * ‘By’, ‘bay’, ‘lake’, ‘east’, ‘south’, ‘hill’ and ‘street’ are English words, presumably mainly mentioned as parts of toponyms. ‘And’ is also an English word.
 * he is a misspelling of . (Note 1: the word does exist as an antonym of  but it's really rare, certainly not among the top 5k. Note 2: I think it's unlikely that he is a misspelling of  because that would run contrary to common pronunciation rules. People would use heh if they cannot use è.) Hee is an alternative spelling of hé. Hey is a misspelling of hé.
 * Although ‘mac’ could be part of ‘McDonald's’, or Macintosh, considering that similar brands like ‘Febo’, ‘Burger King’, ‘Whopper’, ‘Windows’ or ‘Microsoft’, nor the word  are even on the list, I think it's more likely just a part of a character name in almost all instances.
 * Although a is a defender, the absence of,  and  means that this is an English word, probably used as the surname ‘Back’.
 * ‘von’ is a common part of German names, not a Dutch word. ‘da’ is a common part of Romance names, not a Dutch word.
 * Although ‘on’ can be used as parts of phrases like, by itself it isn't a Dutch word. Same goes for ‘all’ in e.g. , and ‘to’ in e.g. . Same for French ‘le’ in e.g..
 * ‘for’ is English, I don't know how it ended up ranking this high in the subtitles. ‘Simply’ is also English; I think it's used a lot in names for shows. ‘Day’ is used mostly as part of, and the like, but separately they wouldn't have made this list.
 * ‘boo’ is puzzling. If it were a misspelling of, you'd expect that to be on the list as well. I think it's probably the name ‘Boo’.
 * Em is a misspelling of . It could also be an abbreviation, but I don't consider that likely. What is worrying though is that  should have been on the list. I strongly suspect the original creator of the list filtered out all single letters. If it hadn't been, maybe it might have had a frequency value of 200k, maybe even 600k or more... which would give it a ballpark ranking of 130 to 50 or so. Compared to that, the contribution of ‘em’ is not significant; we can only accept that a very common word is missing from this list.
 * Uh and euh are misspellings of.
 * Capitalised, , , , , , , , , , , , , , , , , , , , , , , , , and.
 * 5829 ‘San’'s are still unaccounted for, assuming Francisco as a personal name is rare enough not to make the list.
 * 8058 ‘New’'s are still unaccounted for, assuming the city York doesn't occur often enough to make the list.
 * The abbreviation exists, but it's somewhat rare in comparison and I would expect to find it less in subtitles.
 * is a form of . And I'm fuzzy on this, but I also think in tennis a service is a let if the ball touched the net or if the opponent wasn't ready. Not a common term outside of tennis though.
 * An is a forgotten unit of measure of about 1½ gram.
 * I didn't capitalise (Chinese food, Chinese restaurant) and  (to eat Chinese / to chase the dragon) because I've got a hunch that the lower case meanings are more common.
 * There are other named beaches, such as the ones in Normandy, but Miami Beach is the most famous by far.
 * I expect to be a suffix occurring in combinations like 442ste regiment.
 * Since ‘versa’ is missing, is probably a prefix.
 * I think in subtitles ‘no’ is almost always used in the combination, rather than as the abbreviation
 * Although is a word, it's ultra-rare. I'm sure this was  before the original uploader ‘cleaned’ the list. Compare other drinks in the list and their frequencies.
 * ‘hmm’ is a misspelling of.
 * It's tempting to look at ‘Miss’ and ‘World’ and create a ‘Miss World’ entry, but honestly I don't think it would rank that high. I think it's more likely that ‘Miss’ is part of character names like ‘Miss Marple’ and that ‘World’ ranks so high mostly because it's part of so many film titles. If you look at the frequency of ‘Frodo’ (2720) you can see how even a single series can skew the statistics. I'm removing both entries.
 * ln is a misspelling of with a lowercase l instead of a capital I.
 * ‘nlondertitels’ is a subtitle website. ‘Suurtje’ is a prolific subber.
 * Although is a verb form, I think the creator of this list split all hyphenated words, which means that this was part of the word  before the butchering. The word  also exists, but it's much rarer.
 * Perhaps somewhat optimistically combined ‘earl’ and ‘grey’ into . It's possible both terms are sometimes used in names, but I'd expect it to be tea fairly often. Please compare its ranking with tea and other drinks and decide for yourself whether this makes some sense or not. In any case, this still leaves 305 ‘earl’s unaccounted for.
 * Similarly, I merged, leaving 1674 ‘all’s unaccounted for. Yes, I know ‘one’ is also used in other combinations, but even for the most common ones the other parts didn't show up on this list.
 * Although (usually misspelled as ) could mean  (toxic equivalent), I think it's more likely it's a misspelling of  or part of the name of a start-up that considers itself too hip to use the spelling ‘tech’.
 * Fixed the fossilised forms, , and.
 * Because in films and on television the words and  are mostly used as curses and only rarely in the religious sense, I've removed the capitals. There's an ongoing movement in Dutch orthography to use capitals only sparingly, e.g.  is no longer capitalised, nor  in the Bible, nor personal names used as archetypes, just to name a few examples.
 * ‘se’ is part of . (Actually, there are some more combinations with se but they are rare in subtitles.)
 * I'm not sure what do do with ‘you’. It's clearly not a Dutch word and only used in English phrases, probably almost exclusively in the curse . Yet ‘fuck’ (6109) has less occurrences than ‘you’ (6554). Considering that by my sloppy estimation, ‘fuck’ is followed by ‘you’ in ¾ of all cases, that leaves about 2k ‘you’s unaccounted for.
 * and are alternative spellings of the same word. I don't think it's settled yet what the official spelling will end up being. Since it's an English loanword, let's spell it as such for now. By the way, jo is the vocative form of  in Surinam Dutch.
 * Okay is a misspelling of . Whoa is a misspelling of, not to be confused with which means ‘wow’.
 * ‘CTU’ is a fictional agency, among other things.
 * ‘Blue’ can be used in several combinations, but the most common one by far is . This still leaves 30939 ‘the’s unaccounted for.
 * is an alternative spelling alongside.
 * Although is an element,  is a double initial and  means ‘for rent’, I think we're dealing with the English ordinal suffix ‘-th’ here, which is common in film titles and addresses.
 * is a misspelling of . Interestingly, still has an x.