Wiktionary:Votes/2019-03/Excluding typos and scannos

Excluding typos
Voting on: In WT:CFI, making the following change, where crossed out words should be removed and underscored words should be added:

Misspellings, common misspellings and variant spellings : Rare misspellings should be excluded while common misspellings should be included. Typos and scannos (which do not result from ignorance on the part of the author, but are “accidental” misspellings due to a slip of the finger or an OCR bug, for example) should be excluded. There is no simple hard and fast rule, particularly in English, for determining whether a particular spelling is “correct”. Published grammars and style guides can be useful in that regard, as can statistics concerning the prevalence of various forms.

Most simple typos misspellings are much rarer than the most frequent spellings. Some words, however, are frequently misspelled. For example, occurred is often spelled with only one c or only one r, but only occurred is considered correct.

It is important to remember that most languages, including English, do not have an academy to establish rules of usage, and thus may be prone to uncertain spellings. This problem is less frequent, though not unknown, in languages such as Spanish where spelling may have legal support in some countries.

Regional or historical variations are not misspellings. For example, there are well-known differences between British and American spelling. A spelling considered incorrect in one region may not occur at all in another, and may even dominate in yet another.

Combining characters (like ́|this) should exist as main-namespace redirects to their non-combining forms (like ´|this) if the latter exist.

Schedule:
 * Vote starts: 00:00, 7 April 2019 (UTC)
 * Vote ends: 23:59, 6 May 2019 (UTC)
 * Vote created: Chignon – Пучок 18:23, 29 March 2019 (UTC)

Discussion:
 * Stricken words? DonnanZ (talk) 12:04, 2 April 2019 (UTC)
 * Please correct as you see fit. "struck"? "struck through"? "crossed out"? Chignon – Пучок 15:55, 2 April 2019 (UTC)
 * I assume you mean words that have been struck through, isn't used for that. DonnanZ (talk) 16:07, 2 April 2019 (UTC)
 * Yes, fixed. Thanks. Chignon – Пучок 20:28, 2 April 2019 (UTC)
 * The English Wiktionary votes usually do not have a "Discussion:" section directly on the vote page; the discussion about wording usually takes on the vote talk page. I had an urge to remove the above, but decided to keep it for record. --Dan Polansky (talk) 07:32, 7 April 2019 (UTC)
 * Later: Actually, votes do have a discussion section, containing links to discussions, usually at least the link to the vote talk page, in this case Wiktionary talk:Votes/2019-03/Excluding typos and scannos. Discussion does not take place in that section. --Dan Polansky (talk) 08:11, 7 April 2019 (UTC)

Support

 * 1)   --Lambiam 07:08, 7 April 2019 (UTC)
 * @Lambiam: Can I ask for a rationale for including common non-typo misspellings while at the same time excluding common typos? Or can I ask for a location where I can find that rationale? --Dan Polansky (talk) 08:13, 7 April 2019 (UTC)
 * For a rationale for including common non-typo misspellings, see Wiktionary talk:Votes/pl-2014-04/Keeping common misspellings. Like the author of a misspelling, readers may actually believe this is the correct spelling (like for, reinforced by  and ) and think it is an omission if not included. For typos, on the other hand, often caused by fat fingers, the reader will generally recognize this as a typo. For example, a GBS for accdentally gets 96 results, accdientally gets 84 results, accicentally 168, and so on and so forth. If we include all attestable common typos, they will overwhelm the body of correctly spelled terms. If the inability of readers to find entries for typos is really an issue, we should offer suggestions like Google does: Did you mean: "accidentally"?  --Lambiam 11:51, 7 April 2019 (UTC)
 * @Lambiam: But we only include common attested misspellings, not all attested misspellings; in WT:CFI, we explicitly say this: "Rare misspellings should be excluded". shows that *accdientally is so rare that the frequency ratio cannot be determined, and that would lead to exclusion by my lights, as a rare misspelling. The same is true of *accicentally: not found in Google Ngram Viewer => no frequency ratio determined => excluded. By contrast, accross looks like this: . --Dan Polansky (talk) 12:20, 7 April 2019 (UTC)
 * Well, I think accross is not simply a typo. This misspelling arises under the influence of correct spellings like, and . The consistent use, like here and here, is a give-away that the respective authors believe this to be the correct spelling.  --Lambiam 13:05, 7 April 2019 (UTC)
 * I agree that accross is not a simple typo but rather a misspelling that one would probably produce in writing as well. My point with accross was to show how frequency ratio determination works, and how it serves to eliminate most rare typos such as those you mentioned. ---Dan Polansky (talk) 14:40, 7 April 2019 (UTC)
 * 1) . The first definition line at teh is a prime example of something that is common enough, but that nobody would ever look up, because it's an obvious typo. —Μετάknowledge discuss/deeds 18:36, 7 April 2019 (UTC)
 * @Metaknowledge: teh is a very rare misspelling by the frequency ratio standard: . Therefore, it can be excluded without change to policy. --Dan Polansky (talk) 17:23, 13 April 2019 (UTC)
 * Your metric is logically flawed. Your example accounts for roughly 5% of all words in written English, so even if it is misspelled one time in a million it occurs frequently.  -- which represents about 0.0004% of words in written English -- could be misspelled one time in a hundred and occur more rarely. The question should not be how often the word is misspelled, but how often the misspelling occurs. -  TheDaveRoss  12:25, 19 April 2019 (UTC)
 * The metric it not logically flawed. You may think using absolute frequency more appropriate (I don't), but that has nothing to do with logic AKA study of correct inference. Above all, absolute frequency will not help tell whether the form is a misspelling; only frequency ratio can do that. To wit, teh has higher absolute frequency than various "correct" spellings, e.g. . --Dan Polansky (talk) 13:49, 6 May 2019 (UTC)
 * 1) . Getting rid of unintentional errors should be a no-brainer, especially with respect to scannos. Whether these have any lexical content is in my view dubious at best and removing them would improve dictionary quality and user experience. ←₰-→  Lingo Bingo Dingo (talk)  10:49, 8 April 2019 (UTC)
 * @Lingo Bingo Dingo: I would say scannos are excluded anyway as long as we have access to the original. By looking at the original scanned image, we conclude that the scanno was in fact not attested there. --Dan Polansky (talk) 17:25, 13 April 2019 (UTC)
 * 1) . Look through the categories and you can see some really ridiculous fat-finger errors that the typist couldn't possibly have intended. Equinox ◑ 17:06, 8 April 2019 (UTC)
 * 2)  for the reasons given above. — SGconlaw (talk) 03:17, 10 April 2019 (UTC)
 * , including non-words reduces the value of this project immensely. - TheDaveRoss  15:05, 11 April 2019 (UTC)
 * @TheDaveRoss: I don't understand the above; would you be inclined to clarify? Like, reduce the value a little I could understand, from a certain standpoint different from mine, but immensely? Even those who want to build a spell checker from Wiktionary can remove what we have marked as misspellings in a fully automated fashion. --Dan Polansky (talk) 17:49, 13 April 2019 (UTC)
 * , have you tried to sort the actual words from non-words in Wiktionary? I have, it is obnoxiously hard, despite the fact that I am very familiar with the project. Due to the limitations of the wiki structure I think that the most value to be derived from Wiktionary is by those who build wrappers or use the data in other ways, so adding to the difficulty of doing so is more of a problem than having poor formatting. The problem that the inclusion of typos and misspelling is attempting to solve has also already been solved, and better, by search engines, we should implement a software solution rather than adding a bunch of garbage data. For every correct spelling there are an infinite number of incorrect spellings, we cannot actually do a good job of implementing the method we are attempting, so why not instead implement something with fewer issues and an actual chance of success? If people look up misspellings in a dictionary and find them listed they are going to lose faith in the veracity of the dictionary. There is nothing worthy about including these terms. - TheDaveRoss  00:07, 15 April 2019 (UTC)
 * @TheDaveRoss: 1) I have not. What makes it "obnoxiously hard"? This vote concerns entries that have senses marked with misspelling. What makes it hard to identify these senses automatically? I don't follow. It seems pretty straightforward; I did some dump processing myself. For the .tsv that used to be published, which has one sense per line, it is super easy: grep -v "{{misspelling" enwikt-defs-20120821-en.tsv. 2) As for "If people look up misspellings in a dictionary and find them listed they are going to lose faith in the veracity of the dictionary": I don't see why people's finding misspellings marked as "misspellings" would reduce their belief that the Wiktionary tends to be accurate or "lose faith in the veracity". When I find a source reporting a fact accurately, that does not reduce my trust in accuracy. 3) As for "For every correct spelling there are an infinite number of incorrect spellings": Obviously both untrue and irrelevant: rather, for every widely used spelling, there is a very limited (often zero) number of misspellings that are both attested and common, where common means more than attested in 3 instances of use. --Dan Polansky (talk) 06:50, 19 April 2019 (UTC)
 * One problem is that it requires parsing the full text, including knowledge of which templates signify what status of a word. Perhaps we are willing to require that much of users, I would not. Another problem is that, while the lemma may contain the misspelling template, the forms of the misspelling do not. So that means for all form-of entries you must also parse the lemma to find out whether or not the lemma is a misspelling. Further down that rabbit hole, what if the lemma is sometimes spelled correctly, but the form-of entry is for a misspelling? It might be a part of speech question, which could be correctly parsed in an automated (albeit tricky) manner, but what if the scenario is that the lemma is uncountable, but the form-of is a plural of a misspelled countable noun? There are tons of scenarios which have to be accounted for, and they change frequently. All for what benefit? As I have said before this is a bad solution to a problem which has been better solved in other ways. - TheDaveRoss  13:07, 25 April 2019 (UTC)
 * 1) . It does occur to me that a user encountering a typo or scanno who has no familiarity with the intended word may attempt to search for it in the erroneous form. On balance, though, I agree with the arguments put forward in favour of this proposal. Aabull2016 (talk) 15:24, 13 April 2019 (UTC)
 * 2) . - -sche (discuss) 02:02, 14 April 2019 (UTC)
 * 3)  the principle that accidental typos/scannos should be excluded. Support inclusion of common misspellings that people use believing to be correct. Mihia (talk)
 * 4)  as vote creator. Chignon – Пучок 08:06, 15 April 2019 (UTC)
 * 5)  DTLHS (talk) 19:19, 17 April 2019 (UTC)
 * I ultimately agree that 'It does occur to me that a user encountering a typo or scanno who has no familiarity with the intended word may attempt to search for it in the erroneous form,' as above, however this is more likely to occur with rarer words such that including it is largely fruitless given that it won't constitute 'common misspellings' and it will simply get in the way of entries. If you know how it is spelt, you will know to retype it. LingNerd007 (talk) 23:18, 25 April 2019 (UTC)
 * Ineligible to vote. —Μετάknowledge discuss/deeds 02:34, 26 April 2019 (UTC)
 * Ah, in that case, sorry for spamming up the page. LingNerd007 (talk) 07:28, 27 April 2019 (UTC)
 * 1) {{support}}. Scannos in particular can often be avoided if the original text can be examined. DonnanZ (talk) 09:10, 4 May 2019 (UTC)

Oppose

 * 1)    we should include any term that a user might look up. SemperBlotto (talk) 07:18, 7 April 2019 (UTC)
 * 2)  Per Wiktionary talk:Votes/2019-03/Excluding typos and scannos: I think the policy of excluding relatively rare misspellings and including common misspellings works reasonably well as is, and does not need to be changed by introduction of a distinction between a misspelling and a typo. It is reasonably easy to administer using Google Ngram Viewer and frequency ratios. The distinction of a misspelling from a typo is much more speculative than a quantitative frequency criterion; it is much harder to decide whether something is a typo or other kind of misspelling. The rationale for including common misspellings applies to common typos: someone is all too likely to look them up and the best user experience is provided by soft-redirecting the reader to the usual spelling rather than letting them try to figure that spelling out for themselves. The soft redirect still indicates the form to be a misspelling; for example, concieve says "Misspelling of conceive". I do not think the misspelling entries, which include typo entries, make us look unprofessional; we are honestly reporting what word forms are there to be observed, in the intellectually honest descriptivist spirit reminiscent of the spirit of empirical science. --Dan Polansky (talk) 07:43, 7 April 2019 (UTC)
 * 3)  I see no value in this change.--Prosfilaes (talk) 07:50, 8 April 2019 (UTC)
 * 4) I agree that including typos makes Wiktionary look messy at times, but I find it more important that users can find the word they are looking for. Therefore  provided that a) the typo occurs regularly and b) the correct form cannot be found unambiguously by using the 'search' function. Steinbach (talk) 17:25, 9 April 2019 (UTC)
 * [redacted oppose vote] for 2 reasons:
 * 1) English spelling is often unintuitive and I'm directed to the right page more than I care to admit. We have the chance to be more user-friendly than other dictionaries in this regard.
 * 2) A lot of the objections I see to "reduced quality" of the project (from pages that most users never see...) could be addressed if we just developed criteria for what constitutes a common misspelling. I agree that some of our misspellings are a total stretch, so let's just be more selective in which misspellings we allow. Dan Polansky outlines what this might look like at User talk:Dan Polansky/2013. Ultimateria (talk) 17:54, 13 April 2019 (UTC)
 * Forgive me if I'm wrong, but I think you misunderstand what's being proposed here. Admittedly, I should have been delineating things more clearly, and I certainly should have brought up some examples.
 * I'm not suggesting that we delete all misspelling entries. What I'm proposing is drawing a clear distinction (and consequently clearly different treatments) between 1) misspellings and 2) typos/scannos:
 * 1) frequently attested misspellings will remain admissible;
 * 2) typos and scannos won't be admissible, regardless of their frequency.
 * Currently, the sole criterion we're using (and very inconsistently so, I might add) is frequency; this means something like could be added as a "misspelling" of  if it were attested often enough (this is probably not the case, but theoretically it could be; unfortunately, I currently don't have any example of an obvious typo that is actually frequent enough to be admissible).
 * I see that as mindless/mechanical/robotic approach, that makes us introduce cruft in the dictionary, and will ultimately make us look silly. That's what I'm objecting to with this vote.
 * Admittedly, certain cases won't be so clear cut ("is this a typo or a misspelling?"), but we can discuss those.
 * Now, about proper misspellings: the frequency criteria might have to be revisited (User:DTLHS had some which I think should be looked into), but I think this is a different question. Chignon – Пучок 18:25, 13 April 2019 (UTC)
 * Another way of distinguishing a typo from a misspelling would be the following: in the case of a misspelling, some (sometimes many) people think it's the correct spelling; in the case of a typo/scanno, everybody knows it's wrong. It's totally useless having an entry for something people know full well is wrong (because they're not going to look it up, no matter how many times it occurs); I argue it's even harmful from a credibility standpoint. Chignon – Пучок 18:36, 13 April 2019 (UTC)
 * @Chignon: Since we already exclude rare typos by excluding rare misspelling, you have to show that the additional cognitive and deliberative expense that the proposal introduces is worth it. I have not seen that explanation, nor have I seen anything remotely looking like cost-benefit analysis. How are the readers going to benefit from your proposal and how are the editors going to benefit? --Dan Polansky (talk) 18:55, 13 April 2019 (UTC)
 * As for "harmful from a credibility standpoint", I don't understand this oft-repeated argument: we do mark misspelling as misspellings, so what's all that credibility business? --Dan Polansky (talk) 19:02, 13 April 2019 (UTC)
 * But we don't mark typos as typos (there is no template ). We mark them as misspellings . But typos are not misspellings; or if they are misspellings - as I often see you argue -, they're a fundamentally different type of misspellings than... you know, "regular" misspellings. And people expect us to be able to distinguish between the two; ergo, if we want to remain credible, we should not dump them into the same bag. What do you propose to do then? Chignon – Пучок 19:11, 13 April 2019 (UTC)
 * My oft-repeated position is that typos are misspellings, but not every misspelling is a typo (strict hyponymy between typo and misspelling). In any case, I would be ok marking typos specifically as typos, but I do not see why it should make any difference to the user. If I am a user, and I see something is a misspelling (typo or otherwise), I avoid it and that's the point of the marking. --Dan Polansky (talk) 19:16, 13 April 2019 (UTC)
 * Anyway, the definitions of typo and misspelling in various dictionaries seem to support the hyponymy hypothesis. What are your sources to support your non-hyponymy hypothesis? --Dan Polansky (talk) 19:19, 13 April 2019 (UTC)
 * I don't see the relation as hyponymic, unless you define "misspelling" very loosely, and create another subgroup next to the "typos" subgroup.
 * You're seeing things in purely pragmatic terms ("if misspelling, then avoid"), but I don't.
 * If someone ended up on an entry or  (that those are not frequent enough is beside the point) and were told it is a "misspelling" of, possible reactions (in my view) would be: "no that's not a misspelling, that's a typo" "duh, of course it's a misspelling, who needs to be told that?". If that person ended up on an entry such as , that would be different: "oh, I did not know that" or "ah yes, I'm always forgetting that one!". Chignon – Пучок 19:35, 13 April 2019 (UTC)
 * What are your sources? --Dan Polansky (talk) 19:41, 13 April 2019 (UTC)
 * ,, , . Hardly reference works, but you can see this is a common distinction. In any case, I don't think the "hyponymy" question matters. Do you agree there's a difference between an accidental/mechanical mistake (which wouldn't even occur in handwriting, by the way) and an intentional (but mistaken) spelling? Chignon – Пучок 19:45, 13 April 2019 (UTC)
 * : "The term includes errors due to mechanical failure or slips of the hand or finger, but excludes errors of ignorance, such as spelling errors [...]".
 * SPELLING AND TYPOGRAPHICAL MISTAKES DURING USERS’ ONLINE BEHAVIOR.
 * Spelling errors are introduced by either cognitive or typographical mistakes.
 * Spelling Error Trends and Patterns in Sindhi: "The basic type of classification of errors is between typing errors (commonly known as ‘typos’) and spelling errors. Typing errors occurs when the typist or the author knows the actual and correct spelling of the word but mistakenly or by slip of finger presses an invalid key [...]. Whereas the term spelling errors refers to the errors that occur due to the ignorance of the author or typist regardless of the fact that the actual spelling of the intended word is known or not. Usually the terms spelling errors and typing errors are considered to be same and referred to simply as a spelling errors, this sometimes also creates confusion among the users"
 * The terminology doesn't matter; what matters is that many people make a meaningful distinction between two types of mistakes, and I want us to do the same. Chignon – Пучок 19:55, 13 April 2019 (UTC)
 * (after edit conflict) Indeed, these are not reliable sources but thank you at least for these: they show what some other people think. From looking at, I am inclined to believe some people do use the terms typo and misspelling as exclusive categories. As for the distinction, I do see a difference between a typo and a non-typo misspelling, and from what I understand from definitions available, both kinds are subsumed under the headword of misspelling, and therefore, our marking is accurate and does not detract from our credibility. But if it would make people happy to introduce typo of template or the like, in addition to misspelling of, I have no objections. However, most typos are so rare that they fail the current criteria anyway.--Dan Polansky (talk) 19:58, 13 April 2019 (UTC)
 * As for "Spelling errors are introduced by either cognitive or typographical mistakes", that is not only consistent but also suggestive of the hyponymy hypothesis. --Dan Polansky (talk) 19:59, 13 April 2019 (UTC)
 * By the way, you have not voted yet, and you can vote even as the vote creator. --Dan Polansky (talk) 20:14, 13 April 2019 (UTC)
 * Done. Chignon – Пучок 08:06, 15 April 2019 (UTC)
 * Re: my vote. You're absolutely right, I was so focused on the discussion that I incorrectly assumed what the vote was about and barely read the proposal. I have retracted it for now. Ultimateria (talk) 20:07, 13 April 2019 (UTC)
 * 1)  I think there is a fine enough line between determining what is a typo and what is a misspelling (at least in some cases) that we shouldn't make that judgement. Of course, only common typos should be listed. Andrew Sheedy (talk) 02:23, 23 April 2019 (UTC)
 * How does one determine which typos are common and which are not? The word trail is (also) a typo for trial ; is it common? And assuming we have a way for finding out and it turns out to be a common typo, should we really list it, even though it is obviously just an unintentional slip of the finger? --Lambiam 07:37, 25 April 2019 (UTC)
 * In general, we can determine how common a typo is using the frequency ratio found in Google Ngram Viewer (GNV), like for determining whether any non-typo misspelling is common. For the case of trail and trial where overshadowing occurs, we may pass the burden on proof on those who claim that trail is a common typo of trial; they cannot easily use GNV to support that claim, so they have to figure out another method, and if they fail to do so, we may remove the typo as one of which we have no evidence that it is common. --Dan Polansky (talk) 10:01, 28 April 2019 (UTC)
 * 1)  Firstly, there isn't any foolproof way to distinguish between typos and misspellings, and secondly, I don't see much of a reason not to include typos other than vague appeals about "including non-words" (they are words, just with accidental spelling)  and "improving user experience" (removing typos actually diminishes user experience as users might want to find typos for whatever reason). One example of a group that can benefit from typo coverage is people learning languages, who may lack enough knowledge to discern whether a form is a typo or a legitimate word.  --Hazarasp (talk · contributions) 06:22, 23 April 2019 (UTC)
 * 2)  I really don't see any use in removing content that people might actually look up. Maybe we don't need entries for every typo, but I think that unimportant ones should at least redirect to the entries of the correctly spelled words and be listed in an alternative forms section or something. I don't like the idea of removing all typos, especially because these entries might actually be useful to non-native speakers, as User:Hazarasp said. —Globins (yo) 22:20, 6 May 2019 (UTC)

Decision

 * 12-7-0; no consensus. I am glad to see we can still include OCR errors, imagine the cost to humanity if those had been omitted. - TheDaveRoss  17:29, 10 May 2019 (UTC)
 * An OCR error would first have to be attested (WT:ATTEST), and we verify attestation in the original scan as far as possible, or we should. I for one do not count a scanno toward attestation if the raster image shows it to be a scanno. If, however, someone still thinks scannos are poorly treated, a separate vote can be created to only cover scannos. --Dan Polansky (talk) 07:18, 31 May 2019 (UTC)
 * A dark day! Equinox ◑ 18:31, 10 May 2019 (UTC)