Wiktionary talk:About Marshallese

Pronunciation template overhaul
I'm well aware the entire Marshallese dictionary section on Wiktionary needs some serious updates, but maintenance has been difficult.

The pronunciation templates alone are a mess and difficult to learn and remember. Since the writing system, even reformed, isn't one-to-one phonemic, I've already been experimenting with designing a simpler phonemic code for pronunciation templates. I've made good progress with a working demo in JavaScript, but the problem is that I don't know enough Lua, the scripting language MediaWiki actually uses for templates. For example, for the word, instead of using a messy pronunciation tokens like  mh|hah|(h)|J|yeh|lh , the idea is to use a simpler code which scripts can automatically parse and convert to appropriate IPA presentation forms, so we could use  mhahjelh  instead. The syntax would be simple yet strict: Any number of apostrophes could be used between letter combinations to disambiguate them, for example jal'woj for. Dashes and spaces could also be inserted for clarity where appropriate. The scripts (similar to the JS ones I've already written) would have a function to convert the code to an internal orthogonal format; this function would generate an error for malformed code. Other functions could convert the internal format into separate phonemic IPA and phonetic IPA. The phonetic IPA function automatically generates diphthong allophones, detects consonant cluster assimilations, and inserts epenthetic vowels in parentheses. It would drastically reduce the work editors have to do so that pronunciation doesn't have to be so carefully micromanaged allophone-by-allophone.
 * Labial obstruents: p b
 * Coronal obstruents: j t
 * Dorsal obstruents: k kw
 * Labial nasals: m mh
 * Coronal nasals: n nh nw
 * Dorsal nasals: <tt>ng ngw</tt>
 * Coronal trills: <tt>d r rw</tt>
 * Coronal laterals: <tt>l lh lw</tt>
 * Consonantal i: <tt>i</tt>
 * Special consideration is given only for asyllabic i, as this is the only vowel the dictionary specifically distinguishes when reduced to a glide, and the newer orthography gives them special consonant-like consideration in spelling.
 * Glides: <tt>y h w</tt>
 * Vowels: <tt>a e o u</tt>

The internal format is made up entirely of the same spaces, dashes and no apostrophes (those are stripped out during parsing), with two-character consonants and one-character vowels. This makes it very easy to parse with regular expressions. The format is more or less IPA, but does not represent any final presentation form, as the two-letter consonant format directly represents primary and secondary articulation. The conversion of this format to phonemic IPA is most straightforward, as it only changes the sequences to  respectively. Conversion to Bender's format used in the Marshallese-English Dictionary's pronunciation guide is not complicated to do. But since the IPA has never traditionally been friendly to vertical vowel systems because of the large amount of phonemic underspecification involved, conversion to phonetic IPA is a more complicated algorithm, yet I still wrote one that works as expected.

Normally, all words should begin and end with consonant phonemes. Where they don't, the script and templates could infer they are affixes, like <tt>ru-</tt> (for ) or <tt>-un</tt> (for ), whose dangling vowels take on three different vowel allophones depending on the consonants they are fused to. This would necessitate three different phonetic pronunciations for these affixes. There has got to be a better way of handling a pronunciation section than what is currently being used for, which was the moment I realized just how inadequate the current pronunciation templates are. But whether one pronunciation or three (with internal sequences <tt>ɦʲ ɦˠ ɦʷ</tt> attached to dangling vowels before conversion), each is fed through a series of conversions from the internal format: Four-vowel mode represents Bender's vowels, and three-vowel mode represents Choi's vowels. For Willson's vowels, write a function that takes the result of the four-vowel conversion and replaces each of the vowel symbols <tt>æɛeɒɔo</tt> with one of <tt>ɛeɪɔoʊ</tt> respectively.
 * 1) Strip out dashes and spaces.
 * 2) Progressive assimilation of labialized secondary articulation of certain consonant clusters.  For each four-character-long regular expressions (regex) sequence of <tt> /[kŋ]ʷ[kŋ]ˠ|[nrl]ʷ[nrl][ʲˠ]|nʷtʲ/ </tt>, the fourth character is replaced with a labialization symbol (<tt>ʷ</tt>).
 * 3) Since there is no  in Marshallese, each regex sequence of <tt> /tʷ/ </tt> is replaced with <tt>tˠ</tt>.
 * 4) Regressive assimilation of palatalized and velarized secondary articulation of certain consonant clusters.  For each four-character-long regex sequence of <tt> /(?:[pm][ʲˠ]){2}|[tn][ʲˠʷ]t[ʲˠ]|(?:[kŋ][ˠʷ]){2}|(?:[nrl][ʲˠʷ]){2}/ </tt>, the second character is replaced with the fourth character.
 * 5) Regressive assimilation of primary articulation of certain consonant clusters.  For each three-character-long regex sequence of <tt> /p[ʲˠ]m|[rl][ʲˠʷ]n|k[ˠʷ]ŋ/ </tt>, the first character is replaced with the third character.
 * 6) Optionally (based on an argument to the function), the four phonemes may be reduced to three, as has become commonplace in modern speech per Choi.  If so, then each regex sequence of <tt> /[ɜɘ]/ </tt> is replaced with <tt>ə</tt>.
 * 7) Insertion of epenthetic vowels within consonant clusters.  For each six-character-long regex sequence of <tt> /[ɐɜəɘɨ](?:r.t.|l.tʲ|[ptkmŋ].[nrl].|[pm].[tkŋ].|[tnrl].[pkmŋ].|[kŋ].[ptm].)[ɐɜəɘɨ]/ </tt> (evaluated twice because instances can overlap), an epenthetic vowel in parentheses is inserted between the third and fourth character.  The epenthetic vowel's height is irrelevant, but it is commonly transitional between the vowels represented by the first and sixth characters.
 * 8) Conversion of vowels to their diphthong allophones.  For each five-capture-long regex sequence of <tt> /([ʲˠʷ])(\(?)([ɐɜəɘɨ])(\)?.)([ʲˠʷ])/ </tt> (evaluated twice), the vowel represented by the third capture is replaced by a tied diphthong influenced by the secondary articulations represented by the first and fifth captures, and the complete string represented by the captures is otherwise returned with the replaced third capture.  The vowel replacement format I use is <tt>V_V</tt>, with two vowel characters separated by an underscore&mdash;this will be replaced with a proper tie diacritic in the final presentation form.  In three-vowel mode, each of the vowels <tt>ɐəɨ</tt> results in the vowel allophones <tt>ɛei</tt> (palatalized), <tt>ɑʌɯ</tt> (velarized) and <tt>ɔou</tt> (labialized).  In four-vowel mode, each of <tt>ɐɜɘɨ</tt> becomes one of <tt>æɛei</tt>, <tt>ɑʌɤɯ</tt> or <tt>ɒɔou</tt>.
 * 9) Conversion of tied monophthong allophones to simple monophthongs.  For each three-character regex sequence of <tt> /[æɛeiɑʌɤɯɒɔou]_[æɛeiɑʌɤɯɒɔou]/ </tt>, if the first and third characters are the same, only that character is returned, otherwise the original regex match is returned.
 * 10) Eliminate clusters of glide consonants.  Each regex sequence of <tt> /ɦ.ɦ./ </tt> is replaced with a hard syllable break (<tt>.</tt>) to indicate the presence of one extra mora than there would otherwise be.
 * 11) Eliminate glide consonants from clusters where the first consonant is not a glide.  For each three-character regex sequence of <tt> /[ʲˠʷ]ɦ./ </tt>, the second and third characters are replaced with a hard syllable break.
 * 12) Eliminate glide consonants from clusters where the second consonant is not a glide.  For each four-character regex sequence of <tt> /ɦ..[ʲˠʷ]/ </tt>, the first and second characters are replaced with a hard syllable break.
 * 13) Delete remaining glide consonants.  Each regex sequence of <tt> /ɦ./ </tt> is removed with no replacement.
 * 14) Represent double consonants as geminates.  For each four-character regex sequence of <tt> /(?:[ptkmnŋrlĭ].){2}/ </tt>, if the substring represented by the first and second characters is the same as the substring represented by the third and fourth characters, then the third and fourth characters are replaced with a one gemination symbol (<tt>ː</tt>).
 * 15) Replace certain obstruent consonants with partially voiced allophones.  For each three-capture regex sequence of <tt> /([mnŋĭ].|[æɛeiɑʌɤɯɒɔou]\)?|\.)([ptk].)(\(?[æɛeiɑʌɤɯɒɔou\.])/ </tt> (evaluated twice), the second capture is replaced by the appropriate allophone sequence, and returned sandwiched between the first and third captures.  Each second capture match <tt>pʲ pˠ tʲ tˠ kˠ kʷ</tt> becomes one of <tt>bʲ bˠ zʲ dˠ ɡˠ ɡʷ</tt> respectively.  This step could be considered unnecessary, as all styles of obstruent pronunciation can occur in free variation.
 * 16) Represent double vowels as geminated.  For each regex sequence of <tt> /ææ|ɛɛ|ee|ii|ɑɑ|ʌʌ|ɤɤ|ɯɯ|ɒɒ|ɔɔ|oo|uu/ </tt>, the second character is replaced with a gemination symbol.
 * 17) Replace certain sequences with final presentation forms.  For each regex sequence of <tt> /[kɡnŋl][ˠʷ]|[rĭ]ʲ|[bdz_]/ </tt>, each match <tt>b z d kˠ ɡˠ kʷ ɡʷ nˠ nʷ ŋˠ ŋʷ lˠ lʷ rʲ ĭʲ</tt> becomes one of <tt>b̥ z̥ d̥ k̠ ɡ̠̊ k̠ʷ ɡ̠̊ʷ ɳˠ ɳʷ ŋ̠ ŋ̠ʷ ɫ ɫʷ r̪ʲ i̯</tt> respectively, and each underscore character (<tt>_</tt>) becomes an IPA tie bar.  Some of these conversions may be considered unnecessary, but they fit with the general trend of highly detailed pronunciation templates on Wiktionary for languages like Spanish, Japanese, etc.

I'm a decent programmer, but I may need help with the Lua required, as I'm not very used to its syntax or data types or the style of its array indexing. - Gilgamesh~enwiki (talk) 03:59, 20 October 2019 (UTC)

I never expected to pick up a new programming language this rapidly, this usefully. But so far I've done very well writing Module:mh-pronunc. Some invocation examples: Now that this is in good working order, there's test case scripts I still don't have a good idea how to write, as well as documentation which will probably be easier to roughly write, and then transitional deployment to Template:mh-ipa-rows, etc. My thinking is, to minimize disruption to articles as the template syntax is updated, that the old-fashioned token-based templates will display if there are two or more arguments, but a new script-driven template will display if there is only one argument. - Gilgamesh~enwiki (talk) 14:00, 21 October 2019 (UTC)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)

Trying something else, for dangling vowels only meant to be attached to certain consonants. Affixes like are used only without a hyphen before labialized consonants. With this addition to the code syntax, the pronunciation can be tailored to reflect only that vowel contour in the phonetic transcription without adding a glide phoneme to the phonemic transcription. For this, I added pseudo-glide sequences <tt>c ch cw</tt> for palatalized, velarized and labialized affixes respectively. If the code ends with a dangling vowel, the script with show all three possible vowel contours. - Gilgamesh~enwiki (talk) 06:03, 22 October 2019 (UTC)
 * (<tt> </tt>)
 * (<tt> </tt>)

I realized only after the fact just how counterintuitive <tt>c ch cw</tt> syntax is. I chose "c" to represent a consonant on the other side that isn't there... Yeah, a little oblique considering how it actually renders. I changed the syntax to <tt>h_ w_ y_</tt> or <tt>_h _w _y</tt>, basically any one of the letters <tt>h w y</tt> surrounded by any number of at least one underscore on either side. Now that is much more readable. - Gilgamesh~enwiki (talk) 06:21, 22 October 2019 (UTC)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)

Trying something new with comma syntax. And it works! It is not possible to enter more than one code into the templates, separated by commas. The scripts will convert all of them, and weed out duplicate results. This helps cut down on redundant repetition in Choi's three-vowel mode. - Gilgamesh~enwiki (talk) 13:38, 23 October 2019 (UTC)
 * (<tt> </tt>)
 * (<tt> </tt>)
 * (<tt> </tt>)

Trying something new. - Gilgamesh~enwiki (talk) 20:24, 27 October 2019 (UTC)
 * (<tt> </tt>)
 * (<tt> </tt>)