User talk:Robert Ullmann/Pronunciation exceptions

rules in effect
Entries are listed if they trigger any of the following rules, but not if AF would fix or partly fix the entry:


 *    found
 *  [[Rhymes:  found
 *  {|  found, table syntax
 * IPA found other than template name
 * enPR found other than template name
 * SAMPA found other than template name
 * AHD found
 * enPR</tt> after IPA
 * <tt>SAMPA</tt> before IPA
 * not at start, allowing * and : wikisyntax
 * template parameters checked for /'s and []'s
 * template parameters checked for " or " instead of multiple parameters
 * [more, this list is not quite current]

Looking only at Pronunciation sections of course.

More to be added in time, from list below and other observations. Especially doing regex match of the templates and checking the entire syntax.

original inspiration
from BP:

There are quite a lot of pronunciation sections out there that are not formatted according to current practice. To try and sort most of these out, could someone with the appropriate technical know-how set up something similar to User:Robert Ullmann/Mismatched wikisyntax that highlights entries with pronunciation sections that match any of the following criteria:
 * is completely empty (perhaps AF could just add to these?)
 * contains no templates
 * contains /.../ outside of a template
 * contains a <tt>
 * contains a link the the Rhymes: namespace outside of a template
 * contains a table (possibly search for instances of {| )
 * has a template that contains one or more slashes.
 * has an or  template with parameters that do not start and end with / (i.e. / should be the first and last character of all parameters)
 * contains an template anywhere except following a bullet or bullet or indent, (i.e. not one of *, ** , *:  )

And for pronunciation sections of non-English words
 * contains an template
 * contains and/or  templates without a |lang= parameter

Thanks, Thryduulf 13:15, 23 August 2008 (UTC)

Initial observations

 * It might be caught be other rules, but if not "//" should be flagged up for attention.
 * Empty templates should also be flagged, or possibly just replaced by
 * The contents of templates should be ignored (if they aren't already)
 * [r] in IPA pronunciations of English words should be flagged (should be [ɹ]). For now it should be ignored in rhymes, as though they ideally shouldn't be there, it can wait and be sorted as part of a comprehensive look at how we deal with rhymes.

Thryduulf 18:25, 24 August 2008 (UTC)

There are a number of other strings that could be checked for: <tt>(RP)</tt>, <tt>(US)</tt>, <tt> US </tt>, <tt>[WEAE]</tt>, etc, etc ... however we are already finding plenty!

One interesting observation is that almost all the exceptions are in the first 1/3 of the DB (very roughly of course); entries added in the last 18-24 months are in much better shape. It is older ones that haven't been re-visited that are showing up. Robert Ullmann 09:31, 25 August 2008 (UTC)

I've decided not to flag variants of US, UK, and RP at the start of lines, they can all be fixed by automation at some point. (AF rules?) Robert Ullmann 11:56, 25 August 2008 (UTC)


 * As an exceedingly minor point, could you put a space between the opening and closing square brackets in the "no // or []'s in IPA template" rule as "[]" looks like the rectangle used for a character that isn't supported by this font. Thryduulf 13:29, 25 August 2008 (UTC)


 * Done. Also added some whitespace after the (edit) link, makes the rule easier to distinguish and the edit link easier to hit. Robert Ullmann 09:26, 26 August 2008 (UTC)

automation
We should look for things that can be automated; for example this edit was done with regex match/replace.

Some may be a good idea, others not. Robert Ullmann 09:21, 25 August 2008 (UTC)
 * That's a good idea, there are several like that example. Templatisng rhymes should be automatable as well, just replace <tt> Thryduulf 13:24, 25 August 2008 (UTC)
 * Also, sorting the order of the different pronunciation templates should be trivial as well. Thryduulf 13:25, 25 August 2008 (UTC)
 * Is removing slashes from the enPR template something that AF could do? Thryduulf 13:43, 26 August 2008 (UTC)


 * Yes, it is (removing slashes from enPR), we'll keep that in mind. I want to hit some other things first I think. Feel free to ignore things that look like they can be automated later, and work on the ones that are a mess? For example the ones with tables, which no automation will help with. (Well, it probably could be done, and if we had a few hundred thousand of them it would be worth the 3-10 days it would take to code ;-)


 * Note the example of an automated edit above was wrong, it lost the //'s in IPA and SAMPA! I am looking at doing a few things with AWB rules, where it is easy to check each edit; if I run into a large set that always works correctly I can transfer the rule to AF. Converting a lot of cases of US/UK/RP to {a} should be one of those. Robert Ullmann 15:14, 26 August 2008 (UTC)


 * Tables might not take that long, there seems to be at least one standard pattern. I've put one pattern into replace code, and done a few; I'll see how many I can knock out of the manual list. Robert Ullmann 15:17, 29 August 2008 (UTC)

I'm now seeing several in the format "* enPR, /IPA/, /SAMPA/", e.g. "* bû(r)d, /bɜː(r)d/, /b3:(r)d/". Could AF deal with these? Thryduulf 16:50, 12 September 2008 (UTC)


 * Could you also tweak the rhymes pattern matching to catch entries with a space after the : in the displayed text, e.g. Rhymes: -əʊk . Thryduulf 14:05, 19 September 2008 (UTC)


 * Notice there are two patterns:

Rhymes: -əʊk Rhymes: -əʊk


 * it has rules for both. Might not cover quite all the cases. The present run (as noted) misses a number because of a typo, they will go away on the next run. Robert Ullmann 17:21, 19 September 2008 (UTC)


 * Took out all the cases it is set up to fix now; from looking a bit the few remaining with 'Rhymes' have other details not quite right. Robert Ullmann 18:06, 19 September 2008 (UTC)

missing * before IPA
There are a very large number of entries that are missing the * before the IPA template, but are otherwise fine. They are almost always just the IPA pronunciation, nothing else in the section.

By large, I mean somewhere about 12 or 13 thousand.

Please ignore these for now; I'm going to teach AF to fix them, and then fix this report to not include exactly the set that AF will fix. (i.e. if there is something else wrong, or it won't pattern-match in AF to be fixed automatically, it will still be reported. Robert Ullmann 09:24, 26 August 2008 (UTC)

Rhymes without pronunciation
This is a job for the future rather than now, but it would be good to get a list of entries which have a rhymes template but no other pronunciation information. This is probably going to easiest to do by waiting until after the exceptions are (almost) all dealt with and checking for sections that contain without one or more of,  and. Thryduulf 00:26, 30 August 2008 (UTC)


 * Yes, that would be good. And I agree it would be easier when we have run down all the Rhymes links. Note that the ones AF will not fix for some reason will show on this list now. (for example, non-Engliah, but I think those are all templated) Robert Ullmann 17:11, 30 August 2008 (UTC)


 * It wouldn't be hard to make a slightly less refined list that is all entries with Pronunciation sections that contain the string "[Rr]hymes" but do not contain any of ["IPA", "SAMPA", "enPR"], not paying attention to the syntax at all. Robert Ullmann 19:16, 11 September 2008 (UTC)


 * Yes, that will work probably better. However I wouldn't worry about it yet, as we're not short of things to do! Thryduulf 19:49, 11 September 2008 (UTC)

Did some analysis of those entries that have rhymes, or rhymes plus audio, but not IPA/enPR/SAMPA. About 7000 of them at present. See User:Robert Ullmann/t19. I can run this again later when wanted. Robert Ullmann 01:27, 13 September 2008 (UTC)


 * Thanks for that - I'll have to think if there is some way to make that number more manageble to work on! It has thrown up an error that AF could usefully fix - links to rhymes that with a double hyphen, (e.g. --ɪd). Normally this is caused by the first character of the rhymes template being a hyphen, e.g. . If it finds this then AF should just remove the leading hyphen (i.e. convert to ). If it comes across any instances of an explicit (i.e not-templated) link to rhymes with a double hyphen then it should flag this for human attention. Thryduulf 02:21, 13 September 2008 (UTC)


 * Do you think that there are more than a handful? Probably just a few cases where someone put the - inside the tamplate? May not be worth it. I can just have the exception report flag "hymes:--" and "hymes|-" ? Robert Ullmann 01:01, 14 September 2008 (UTC)


 * There will be a few I expect, as I've done it a few times that I've spotted and I wouldn't be surprised if there were some that I didn't spot. Wild off-the-top-of-my-head guess I'd expect to see something in the order of 20 instances, so flagging them will probably be fine. Have you kept a count anywhere of how many times each flag is matched? If not, don't bother adding it if it's more than a minute or so's work, it's just idle curiosity. Thryduulf 02:09, 14 September 2008 (UTC)


 * Will flag them now; I'm not going to try to make a table; they do all show in the /remains sub-page unless the entry was already tagged for something else. Robert Ullmann 13:51, 15 September 2008 (UTC)

Other useful reports of this nature would be: Even though there is potential for crossover between the reports (i.e. audio and SAMPA only) I think they should be kept separate. This is as at entries for words the person looking at the report doesn't know, the IPA could be added based on what is there. However, doing this from audio and doing it from SAMPA are very different tasks. Thryduulf 04:42, 8 November 2008 (UTC)
 * enPR and/or SAMPA without IPA
 * Audio without IPA

Tables
I've used a bit of semi-automation to pattern-match tables. Not terribly successful; it has taken 14 patterns to match less than 100 cases. Still a bit easier than doing them by hand; and more accurate. Mostly done with this, and I'll stop reloading the exception report as often. If you see a table pattern that it didn't fix that occurs more than once or twice, I can just add another. Robert Ullmann 17:11, 30 August 2008 (UTC)

11 September
I've added several more flags, and fixed a bug that kept a number of things from being found. Count has increased to 5K+

However, there are a number of rules that should be added to AF to fix problems


 * do Rhymes for Finnish
 * handle simple case of rhymes without "Rhymes: " in front
 * add a for various other tags, PRC, Taiwan, and some others
 * more RP/GenAM/US/UK cases

So don't spend time fixing those, I'll see what I can screen on the next run.

This will deal with a large part of the 5K. Robert Ullmann 19:07, 11 September 2008 (UTC)

hyphenation
Quite a few of the current exceptions are objecting to "* hyphenation..." , this can be avoided by converting them to "* ". Thryduulf 14:12, 12 September 2008 (UTC)


 * With some issues surrounding whether they use middot or some other character ... added to AF and will not show here. Some things are going to end up (or already are) using the hyphenation template with only one parameter, we can catch those a bit later. Robert Ullmann 13:53, 15 September 2008 (UTC)
 * I've seen a whole plethora of characters used in hypenation templates - · ‧ — . &amp;middot; are the ones off the top of my head that I can remember. I've typically standardised them to · but I've not been at all rigorous about this. I didn't know the standard was multiple parameters (or even that it was an option) until very recently. Thryduulf 15:03, 15 September 2008 (UTC)
 * What is that other (bolder) middle dot? The one that isn't middot? I have it fixing middot, whether character or HTML entity. The point of making them multiple parameters is of course partly to solve this problem; it can then be changed or even made customizable. I should probably catch any use of the hyphen template that contains non-word characters other than | ... Robert Ullmann 18:54, 15 September 2008 (UTC)
 * The firefox identify characters extension reports the character "·" as "U+B7 MIDDLE DOT", which is what I get with alt gr + . on my kubuntu linux setup. The other one ("‧") is reported as "U+2027 HYPHENATION POINT", which is what is produced if you use the "Misc." section of the edit tools. Thryduulf 20:28, 15 September 2008 (UTC)
 * Ah ... I will fix 2027 as well as 00B7 ;-) Robert Ullmann 23:40, 15 September 2008 (UTC)

mismatched IPA and SAMPA
Given that there is a direct correspondence between IPA symbols and SAMPA symbols, would it be possible for a bot to add to this report instances where they do not match, if it were given a list of the correspondences to check for?

The rules I'm thinking would be (where "XPA" means "IPA" or "SAMPA"):
 * Look only at English pronunciation sections
 * Ignore any lines that <1 IPA and/or <1 SAMPA template
 * compare only IPA and SAMPA templates that appear on the same line
 * If one or both of these are not in the format or , etc. then ignore this line and go to the next (hopefully these will be caught by the other rules).
 * If the IPA and SAMPA templates contain a different number of pronunciations (e.g., ) then flag as "mismatched number of pronunciations" (or similar wording), and move on to the next line.
 * If there is a correct correspondence between all the separate parameters of the templates, then everything is fine and move on to the next line (i.e. parameter 1 of the IPA template corresponds to parameter 1 of the SAMPA template)
 * If there is a mismatch, compare again ignoring all syllable break characters (full stops, ".") in both templates.
 * If this produces a match, then flag as "mismatched syllable breaks" (or similar warning), and continue to ignore the characters for all subsequent comparisons on this line.
 * If the IPA template contains "r", change it to "ɹ" and continue
 * If the IPA template contains "ɹ" and the SAMPA template contains "r" at the same position, change the latter to "r\" and continue
 * If the SAMPA template contains "y" and the IPA template has "j" in the same position, change the former to "j" and continue
 * If the IPA template has "g", change it to "ɡ" and continue
 * If the IPA or SAMPA contain any symbols not in the correspondence list, then make a note of the entry, section and character move onto the next line. (the note is so I or someone else can add that character to the list for a future run if appropriate).

The correspondence list would be something like

Although thinking about it some of these tasks might be better for something other than a report generator but it's gone 2am and I need to be up in the morning so I'll leave it for your and others comments at this point! Thryduulf 01:09, 14 April 2010 (UTC)


 * As a very simple first step towards this, flagging entries where the IPA and SAMPA templates have different numbers of parameters would be useful. Thryduulf 16:35, 24 April 2010 (UTC)

template:it-stress
See also WT:GP. This report (or maybe AF?) should flag any instance of on non-Italian entries. Thryduulf 13:31, 21 April 2010 (UTC)

Ignore contents of template:attention
Per its use at invidia, the exceptions report should ignore the contents of. Thryduulf 14:12, 23 April 2010 (UTC)

Hyphenation 2
Arghhh. User:Thryduulf has been using to add "dog-pronunciaitons" to entries. Can AF check the contents of the {hyphenation} templates, strip out the piping and compare against the pagename? These should always match for English (though not necessarily for Italian, Balitic, or Slavic langugages). --EncycloPetey 16:01, 24 April 2010 (UTC)
 * I've not really been adding them. On the handful of occasions I've found such pronunciations, I've been shoehorning them into, more as something to do with them rather than with any great thought. If they are being checked as EP suggests, then the check should be case sensitive. Thryduulf 16:32, 24 April 2010 (UTC)
 * No, it's worse than that. From other conversations I've had today, it's become clear that a number of editors are mis-using the hyphenation template to show syllabation, which is flat-out wrong.  Syllabation refers to pronunciation of the spoken language, while hyphenation refers to typography of the written language.  These are not the same thing, and can be domonstrated as different.  Consider axis, which hypenated as "ax&middot;is" but syllabates as /ak.sis/.  Because we've insisted on (incorrectly) keeping hyphenation in the pronunciation section, the template has been misused.  We're going to have to look at, and manually evaluate, every use of Hyphenation across Wiktionary in some languages, it seems. --EncycloPetey 20:40, 24 April 2010 (UTC)
 * Perhaps we should have syllabation templated in the Pronunciation section, as it would seem a useful thing to have if people have been using hyphenation in this manner. Having the hyphenation in the pronunciation section has always seemed a bit odd to me, but I've not really thought too much about it or where it would be better placed. Thryduulf 21:50, 24 April 2010 (UTC)
 * We already do have syllabation templated in the Pronunciation section. That's what, , and  do.  It is not possible to indicate syllabation using the word's original spelling in any language that I've studied (although it might be possible in Sanskrit or Japanese).  Marking syllabation requires a phonetic respelling be used, because there are written characters in most languages that straddle two spoken syllables, and there are silent letters that cannot be assigned to any syllable.


 * Hyphenation would be better placed in a "Spelling" or "Orthography" section along with alternative forms/spellings. --EncycloPetey 00:45, 8 May 2010 (UTC)


 * I'm pretty sure you can always mark syllabation in Swahili (and a large class of similar languages) with the language spelling. At least I can't think of any counterexamples or language forms that would generate counterexamples. But in general, yes, in most language groups you cannot.


 * We don't have any good place for hyphenation, so it's just ended up in Pronunciation.


 * If you want to take a serious look, and it would be useful to extract (and perhaps classify) the uses, please tell me. Robert Ullmann 00:36, 17 May 2010 (UTC)

"IPA" in description of audio files
The string "IPA" sometimes appears in or as the label of pronunciation files, e.g. at u. This is being flagged by the "IPA not template" rule, but it isn't actually a problem as it isn't an IPA transcription. I think the best thing to do will to just ignore the contents of templates completely (in the same way the contents of  is. We can come back to them later if there is a large problem with them (I'm not aware of one). Thryduulf 18:17, 7 May 2010 (UTC)