User:Connel MacKenzie/dictionary writing

I've had quite a run here on en.wiktionary.org so far. One of the more outstanding experiences is having had the Editor of NOAD actually reply to my e-mails (not form-letter style, but genuine, thoughtful responses, all.) [Note: I mean the former Editor-In-chief. Apparently, Oxford has since taken "North America" to mean a suburb of London and insisted that all editing of their dictionary be done by non-North Americans, presumably with "Rule Britannia" in bold, before every definition. But I digress.]

The most poignant thing I heard, from those comments, was something I desperately did not want to hear. That was that from a lexical perspective, almost nothing should be deleted from en.wiktionary.org. The "wiki-is-not-paper" concept was secondary: first and foremost, was that en.wiktionary, as a dictionary, is too new, to even guess at how people will use it. (Well, it was worded quite a bit differently in our conversations, but that was the most salient recommendation.)


 * I've pondered that a lot lately. It is a painful eye-opener.

I'm not sure I realized it, but I have a very specific idea of what I want en.wiktionary to be. Actually, I am certain, that I never fully comprehended just how specific my notion of what a dictionary should be, is. Most of my notion, has so far been subconscious.

I think a dictionary:
 * Should have at least the 30,000 most common "root words" in the language (plus all their inflections, plurals, adjective, adverb, comparatives and superlatives.)
 * Should have no words that are not in the 600,000 most common terms in the language
 * Should have only sanitized entries for "freak" terms
 * Should redirect misspellings to proper spellings (soft or hard links) whenever possible by software
 * Can have specialty term, only if they can be isolated
 * Should have no foreign language entries (but translations of English words, as an option)
 * Should have appendix entries only for language related subjects
 * Should emphasize definitions above all else
 * Should have audio pronunciation files (only) for all terms
 * Should have no multi-word entries
 * Should not have names (trademark, surname, given name, etc.)
 * Should not list gerunds
 * Should not be British "replace the word with this definition" style, but explanatory sentence style, instead, covering as many parts of speech per definition as possible.
 * Can have etymological footnotes, but only when expounding the definition. (I.e., no formulaic cruft.)
 * Could have a mode of restricting searches to "core spellings" only. (I.e. a driver for a spell-check engine.)

The biggest barrier to Wiktionary's usefulness that I see, is that we currently have no way of segregating "specialties" (such as foreign languages) in a meaningful way. All lookups currently get shit-flooded with BSDM, leet, Spanish and entertainment definitions. Worse still, is that there is no way to add a "specialty" vertical segment...say, Pascal programming jargon, Pulmonary flow jargon, soap opera catch-phrases, cancer research jargon, Place names in Swaziland, comedy one-liners, botany jargon or legal jargon, without such items (useful as a collection in their own right) having to each individually pass WT:CFI. That is, it should be possible to add a collection of "MUMPS programming terms" as a segment, that doesn't appear in a general search by default. That is, a default search should not search all entries, but only English "core" terms.

Likewise, no method exists (today) for shunting nonsense off to the side, somewhere. How many times, do certain vulgarisms have to be re-entered?

If it seems like I'm deleting a lot less, lately, I assure you, it is not your imagination. It is because of these conversations over the last several weeks. I don't think I've ever claimed to be a lexicographer (not even as an amateur,) but I'm amazed at how much I am still learning today about the difficulties of writing a dictionary from scratch.

Our current policies focus on using tools such as Google or Altavista to include every imaginable possible typo. But the result is an unusable mess with an astonishingly low signal to noise ratio. With a dozen programmers working toward a common goal, Wiktionary can become useful in a year or two. With the current anarchy, where trolls rule the day, the same results may take ten to twenty years to achieve.

--Connel MacKenzie 06:43, 6 March 2007 (UTC)

May 2007
I have received some wonderful feedback on this page, but would like to clarify a few things that I guess were ambiguous above.

I do not think a "freak" definition should be removed, per se. But I insist that such a thing should be sequestered off in a manner that limits it only to those looking for it. In en.wiktionary.org history, this usually has meant that some strange S&M figurative use of a real word is given a "new" meaning. The problem is that the "normal" English word often has not been entered first. Look up hummer here. This is a very (VERY) mild example of what I'm talking about. The musical act of making a tune with one's lips closed isn't even mentioned, yet the fellatio colorful, figurative use, is. That is absurd. The fellatio meaning certainly is not more common than the musical act, in writing or in spoken English. Yet it is the one people would likely look up in disbelief. (Well, disbelief, in elementary or junior high school, I suppose.)

So, should the "fellatio" meaning be listed? Of course. Should it appear as the default search result? Absolutely not. We should have links to sub-sections (probably hidden by default) that indicate there is an additional definition available in the "sex slang" topic subsection.

Now, I also said "sanitized" but because that is ambiguous, was misinterpreted. The entry for fuck here on en.wiktionary.org is pretty much OK. It is a sensitive term, likely to cause offense, therefore is dealt with in a "sanitized" professional manner. Unfortunately, of all the vulgar terms we have, that is probably the only entry that has been exhaustively reviewed. Most of the rest that we have, are written in the most puerile and offensive tone possible. No systematic approach to cleaning them has been taken, to date. This makes it hard to objectively consider if they should be sequestered or not.

I would also like to address the misinterpretation of my closing paragraphs, above. I was not lamenting that these features are not technically feasible. Hippietrail's proof of concept on http://wiktionarydev.leuksman.com/ show that these features are easy to implement. My concern is that the en.wiktionary.org community won't adopt these features in a timely manner. (Whether they like it or not, they'll have these features eventually, whether it is me, Hippietrail, Brion, Tim, Rodasmith, Scs or someone else is irrelevant; it is inevitable.)

Thank you for the various blog feedback.

--Connel MacKenzie 19:36, 13 May 2007 (UTC)