Module:User:Benwing2/de-pron

--[=[

Implementation of pronunciation-generation module from spelling for German.

Author: Benwing

STRESS

If there are no prefixes or suffixes, the default position of the stress is on the first syllable; e.g. arbeiten /ˈaʁbaɪ̯tən/ "to work" or hundert /ˈhʊndɐt/ "hundert". This may change in the presence of prefixes or suffixes. For example, the module recognizes the prefix er- as unstressed, and hence in a word like Ermächtigung /ɛʁˈmɛçtɪɡʊŋ/ "authorization" the stress is instead placed on the first syllable of the main part. Similarly, the module recognizes the suffix -ieren as stressed, and hence abandonnieren /abandɔnˈiːʁən/ "to abandon". The module also knows about stressed prefixes such as auf-, aus-, über-, etc. and unstressed suffixes as -chen and -lich. Although they don't change the position of the stress, they have other effects. For example, in aufstellen /ˈaʊ̯fˌʃtɛlən/ "to set up" the main part following the stressed prefix gets secondary stress and is treated as word-initial (hence 'st-' is rendered as /ʃt/). Similarly, in Mädchen /ˈmɛːtçən/ "girl" the main part is treated as if word-final and hence the stressed vowel 'ä' is lengthened before a single word-final consonant. In the presence of both a stressed prefix and suffix, the prefix takes precedence and the suffix gets secondary stress, hence ausprobieren /ˈaʊ̯spʁoˌbiːʁən/ "to try out".

Use an acute accent on a vowel to override the position of primary stress (in a diphthong, put it over the first vowel): á é í ó ú ä́ ö́ ǘ ái éi áu ä́u éu. Examples: systemisch 'systémisch' /zʏsˈteːmɪʃ/ "systemic", Migräne 'Migrä́ne' /miˈɡʁɛːnə/ "migraine". Use a grave accent to add secondary stress: à è ì ò ù ä̀ ö̀ ǜ ài èi àu ä̀u èu. Examples: Prognose 'Prògnóse' /ˌpʁoˈɡnoːzə/ "forecast", Milligramm 'Milligràmm' /ˈmɪliˌɡʁam/ "milligram" (the primary stress takes its default position on the first syllable, as it is unmarked).

COMPOUNDS --

Use a hyphen (-) to indicate a word/word boundary in a compound word. The result will be displayed as a single word but the consonants on either side treated as if they occurred at the beginning/end of the word, and each part of the compound gets its own stress (normally primary stress on the first part and secondary stress on the remaining parts). Examples: Hubschrauber 'Hub-schrauber' /ˈhuːpˌʃʁaʊ̯bɐ/ "helicopter", Landeplatz 'Lande-platz' /ˈlandəˌplat͡s/ "landing place, landing pad".

Use a double hyphen (--) to indicate a word/word boundary in a compound word where one or both of the component words itself are themselves compounds. Here, the original secondary stresses turn into tertiary stresses (which aren't shown but are treated internally as stressed) and the primary stress in the second and further components becomes secondary. An example is Hubschrauberlandeplatz "helipad" (literally "helicopter landing pad") respelled 'Hub-schrauber--lande-platz' and rendered [ˈhuːpʃʁaʊ̯bɐˌlandəplat͡s]. Comparing this with the above renderings of Hubschrauber and Landeplatz, it can be seen that the original primary stress in Landeplatz becomes secondary while the original secondary stresses disappear. The loss of secondary stress has no other effect on the phonology. For example, consider Rundflug 'Rund-flug' /ˈʁʊntˌfluːk/ "sightseeing flight" (literally "circular flight"). When combined with Hubschrauber, the result is Hubschrauberrundflug 'Hub-schrauber--rund-flug' /ˈhuːpʃʁaʊ̯bɐˌʁʊntfluːk/ "helicopter sightseeing flight", where the last vowel stays long despite (apparently) losing the stress.

Other examples to demonstrate the use of single and double hyphens in compounds: (literally "gastric mucosal inflammation") 'ˈkʁaftfaːʁt͡sɔɪ̯kˌhaftp͡flɪçtfɛʁzɪçəʁʊŋ' "motor vehicle liability insurance" /ˈaɪ̯ɐʃaːlənˌzɔlbʁʊxʃtɛlənfɛʁˌʔuːʁzaxɐ/ "egg cracker, eggshell breaker" (literally "eggshell breaking point causer")
 * Maulwurfshügel 'Maul-wurfs--hügel' /maʊ̯lvʊʁfsˌhyːɡəl/ "molehill"
 * Hubschrauberabsturz 'Hub-schrauber--absturz' /ˈhuːpʃʁaʊ̯bɐˌʔapʃtʊʁt͡s/ "helicopter crash"
 * Aufenthaltsgenehmigung 'Aufenthalts-genehmigung' /ˈaʊ̯f(ʔ)ɛnthalt͡sɡəˌneːmɪɡʊŋ/ "residence permit"
 * Magenschleimhautentzündung 'Magen-schleim-haut--entzündung' /ˈmaːɡənʃlaɪ̯mhaʊ̯t(ʔ)ɛntˌt͡sʏndʊŋ/ "gastritis"
 * Kraftfahrzeug-Haftpflichtversicherung 'Kraft-fahr-zeug--Haft-pflicht-versicherung'
 * Eierschalensollbruchstellenverursacher 'Eier-schalen--soll-bruch-stellen--verursacher'

Explicitly marked stresses using acute and grave accents are *relative* stresses, i.e. they are relative to the component they are within. To make this clear, consider Pilot 'Pilót' /piˈloːt/ "pilot". When combined with Hubschrauber, the result is Hubschrauberpilot 'Hub-schrauber--pilót' /ˈhupʃʁaʊ̯bɐpiˌloːt/ "helicopter pilot". Here, the rule of compound stressing specifies that the first component gets primary stress while remaining components get secondary stress. The primary stress specified here by the acute accent on Pilot is *relative* to the overall component stress, and hence it is displayed as secondary stress.

Other examples with explicitly marked stress in compounds: "(female) prime minister candidate"
 * Ministerpräsidentenkandidatin 'Miníster-präsidénten-kandidátin' /miˈnɪstɐpʁɛziˌdɛntənkandiˌdaːtɪn/
 * Eisbearbeitungsmaschine 'Eis-bearbeitungs--maschíne' /ˈaɪ̯sbəʔaʁbaɪ̯tʊŋsmaˌʃiːnə/ "ice resurfacing machine"
 * Hochtemperaturreaktor 'Hohch-temperatúr-reáktor' /ˈhoːxtɛmpəʁaˌtuːʁʁeˌaktoːʁ/ "high temperature reactor"
 * Arbeitsbeschaffungsprogramm 'Arbeits-beschaffungs--prográmm' /ˈaʁbaɪ̯t͡sbəʃafʊŋspʁoˌɡʁam/ "job creation program"

"helicopter pilot", compounded of Hubschrauber and Pilot "pilot". The latter must be respelled 'Pilót' due to the unexpected stress. The combination Hubschrauberpilot should correspondingly be respelled 'Hub-schrauber--pilót', rendered as  [ˈhupʃʁaʊ̯bɐpiˌloːt]. Here, the primary stress should be converted to secondary; otherwise the word would wrongly end up rendered as [ˌhupʃʁaʊ̯bɐpiˈloːt], where the module assumes that since the second word has the primary stress, other words (including the first one) should have secondary stress. Other similar examples: Hubschrauberabsturz "helicopter crash" 'Hub-schrauber--absturz' /ˈhuːpʃʁaʊ̯bɐˌʔapʃtʊʁt͡s/ (the stressed  prefix ab- is automatically recognized) and Maulwurfshügel "molehill" /ˈmaʊ̯lvʊʁfsˌhyːɡəl/. Occasionally the double hyphen should be used even when no single hyphens occur. Another example of note is Aufenthaltsgenehmigung "residence permit". Aufenthalt by itself (without respelling) is rendered /ˈaʊ̯f(ʔ)ɛntˌhalt/, i.e. it has secondary stress. The respelling 'Aufenthalts-genehmigung' is rendered /ˈaʊ̯f(ʔ)ɛnthalt͡sɡəˌneːmɪɡʊŋ/ without the original secondary stress; the module recognizes the need to remove the secondary stress here. Essentially, if two words are joined by an ordinary hyphen and one of them has secondary stress due to a stressed prefix, the hyphen is implicitly "upgraded" to a double hyphen.

Things may get more complex when stresses "bump up" against each other. For example, Abhängigkeit "dependency" with no respelling is rendered [ˈapˌhɛŋɪçkaɪ̯t]. In combination as e.g. Drogenabhängigkeit "drug dependency,  drug addiction" respelled 'Drogen-abhängigkeit', the result is /ˈdʁoːɡənˌʔaphɛŋɪçkaɪ̯t/, which appears correct. In combination as e.g. Alkoholabhängigkeit respelled 'Alkohól-abhängigkeit', the result is  /alkoˈhoːlˌʔaphɛŋɪçkaɪ̯t/, which may sound wrong to native ears because of the two stresses bumping up against each other. The actual pronunciation may be more like /alkoˈhoːl(ʔ)apˌhɛŋɪçkaɪ̯t/ with the secondary stress moving (respell 'Alkohól-abhä̀ngigkeit') or /alkoˈhoːlʔaphɛŋɪçkaɪ̯t/ with the secondary stress disappearing (respell  'Alkohól-ȧbhängigkeit' where the dot-above suppresses the stress marker but otherwise doesn't change the   phonology).

2. To specify absolute stress c. NOTE: "Relative" stress refers to how stress is handled specifies the stress on a particular com Double grave accent to add tertiary stress: ȁ ȅ ȉ ȍ ȕ ä̏ ö̏ ü̏ ȁi ȅi ȁu ä̏u ȅu. Tertiary stress has the same effect on vowels as secondary stress (e.g. they lengthen in open syllables) but is rendered without a stress mark. Under normal circumstances, you do not have to explicitly add tertiary stress. Rather, secondary stresses (including those generated automatically) are automatically converted to tertiary stress in certain circumstances, e.g.  when compounding two words that are already compounds. See the discussion on -- (double hyphen) below. 'h' or ː after a vowel to force it to be long. Circumflex on a vowel (â ê î ô û ä̂ ö̂ ü̂) to force it to have closed quality. Breve on a vowel, including a stressed vowel (ă ĕ ĭ ŏ ŭ ä̆ ö̆ ü̆) to force it to have open quality. Tilde on a vowel or capital N afterwards to indicate nasalization. For an unstressed 'e', force its quality using schwa (ə) to indicate a schwa, breve (ĕ) to indicate open quality /ɛ/, circumflex (ê) to indicate closed quality /e/. . (period) to force a syllable boundary. - (hyphen) to indicate a word/word boundary in a compound word; the result will be displayed as a single word but the consonants on either side treated as if they occurred at the beginning/end of the word, and each part of the compound gets its own stress (primary stress on the first part unless another part has primary stress, secondary stress on the remaining parts unless stress is explicitly included). -- (double hyphen) to indicate a word/word boundary in a compound word where one (or both) of the component words itself is a compound. Here, the original secondary stresses turn into tertiary stresses (not shown) and the primary stress the second component (and further components) becomes secondary. See discussion below. < (less-than) to indicate a prefix/word or prefix/prefix boundary; similar to - for word/word boundary, but the prefix before the < sign will be unstressed. > (greater-than) to indicate a word/suffix or suffix/suffix boundary; similar to - for word/word boundary, but the suffix after the > sign will be unstressed. + (plus) is the opposite of -; it forces a prefix/word, word/word or word/suffix boundary to *NOT* occur when it otherwise would. _ (underscore) to force the letters on either side to be interpreted independently, when the combination of the two would normally have a special meaning. ̣ (dot under) on any vowel in a word or component to prevent it from getting any stress. ̯ (inverted breve under) to indicate a non-syllabic vowel. Most common uses: i̯ in words like Familie respelled 'Famíli̯e'; o̯ in French-derived words like soigniert respelled 'so̯anjiert' (-iert automatically gets primary stress); occasionally y̯ in words like Ichthyologie respelled 'Ichthy̯ologie' (-ie automatically gets  primary stress). There is also u̯ but it's mostly unnecessary as a 'u' directly followed by another vowel by default becomes non-syllabic. Finally, the generated phonemic notation includes /aɪ̯/ for spelled 'ei' and 'ai'; /ɔɪ̯/ for spelled 'eu' and 'äu'; and /aʊ̯/ for spelled 'au'; and the generated phonetic notation includes [ɐ̯] for vocalized /ʁ/ (i.e. written 'r' in a syllable coda). However, you rarely if ever need to type these symbols explicitly.

Notes:

1. Doubled consonants, as well as digraphs/trigraphs etc. like 'ch', 'ng', 'tz', 'sch', 'tsch', etc. cause a  preceding vowel to be short and open (/ɛ ɪ ɔ ʊ œ ʏ/) unless lengthened with h or ː. 2. With occasional exceptions, a vowel before a single consonant (including at the end of a word or component) is closed (/e i o u ø y/), and long if stressed (primary or secondary). 3. The vowel 'e' is rendered by default as schwa (/ə/) in the following circumstances: a. The prefixes 'ge-', 'be-' are recognized specially and rendered with a schwa and without stress. This doesn't     apply if: i. The 'e' is respelled with an accent, e.g. 'Génitiv' for Genitiv "genitive" or 'géstern' for gestern "yesterday". ii. A + is placed at the putative prefix boundary. Hence Geograf "geographer" could be respelled 'Ge+ográf' to prevent 'Ge' from being interpreted as a prefix. iii. The cluster following the 'ge-' or 'be-' cannot be the start of a German word, e.g. bellen "to bark" ('ll' cannot start a word) or bengalisch "Bengal" ('ng' cannot start a word). a. If there is a following stress in the word, only in an internal (non-initial) open unstressed syllable (i.e.     followed by only a single consonant) when 'r' follows, as in Temp̱e̱ratur, Gene̱ralität, Souve̱ränität, Hete̱rogenität, degene̱rieren, where the underlined vowels are by default rendered as a schwa. Other cases with schwa like Sozietät or Pietät need to be respelled with a schwa, e.g. 'Soziǝtät', 'Pìǝtät'. b. If there is no following stress, any 'e' word-finally or followed by consonants that might form part of an     inflectional ending is rendered as a schwa, e.g. Lage̱, zuminde̱ste̱ns, verschiede̱ne̱n, verwende̱t. Examples where this does not happen are Late̱x, Ahme̱d, Bize̱ps, Borre̱tsch, Brege̱nz, which would be rendered by default with /ɛ/ or /e/. Cases like Achilles and Agens that have /ɛ/ but end in what looks like an inflectional ending should be respelled with ĕ. 4. Obstruents 'b' 'd' 'g' 'v' 'ʒ' are normally rendered as voiceless /p t k f ʃ/ at the end of a word or syllable. To cause them to be voiced, one way is to move the syllable boundary before the consonant by inserting a. before the consonant. If that isn't appropriate, add a * after the sound, as in 'b*', 'd*', 'z*', etc. 5. 'v' is normally rendered as underlying /v/ (which becomes /f/ at the end of a word or syllable). Words like vier, viel, Vater, etc. need respelling using 'f'. Note that prefixes ver-, vor-, voraus-, vorher-, etc. are recognized specially. 6. French-derived words often need respelling and may have sounds in them that don't have standard spellings in  German. For example, Orange can be spelled 'Orã́ʒe' or 'OráNʒe'; use a tilde or a following capital N to  indicate a nasal vowel, and use a 'ʒ' character to render the sound /ʒ/. 7. Rendering of 'ch' using ich-laut /ç/ or ach-laut /x/ is automatic. To force one or the other, use 'ç' explicitly (as in 'Açilles-ferse' for one pronunciation of Achillesferse "Achilles heel") or 'x' explicitly (as in  'X*uzpe' for Chuzpe "chutzpah"). 8. Vowels 'i' and 'u' in hiatus (i.e. directly before another vowel) are normally converted into glides, i.e. /i̯ u̯/, as in effizient (no respelling needed as '-ent' is recognized as a stressed suffix), Antigua (respelled  'Antíguah'). To preven this, add a '.' between the vowels to force a syllable boundary, as in aktualisieren 'àktu.alisieren'. An exception to the glide conversion is the digraph 'ie'; to force glide conversion, as in  Familie, use 'i̯' explicitly (respelled 'Famíli̯e'). Occasionally the ̯ symbol needs to be added to other vowels, e.g. in Ichthyologie respelled 'Ichthy̯ologie' (note that '-ie' is a recognized stressed suffix) and soigniert respelled 'so̯anjiert' ('-iert' is a recognized stressed suffix).

FIXME:

1. Implement which works like - but don't trigger secondary stress (< after a prefix, > before a suffix). 2. Implement <!, -! which work with < and - but suppress the vowel-initial glottal stop; <?, -? which work similarly but allow for optional glottal stop. 3. Implement prefix/suffix stripping; don't do it if explicit syllable boundary in cluster after prefix, or if no  vowel in main, or if impossible prefix/suffix onset/offset. 4. Automatically support -ge-, -zu- after stressed verbal prefixes. 5. Figure out how to support stacked suffixes like -barkeit, -samkeit, -lichkeit, -schaftlich, -licherweise. Perhaps there is a small enough set of common stacked suffixes to just list them all. 6. Finish entering prefixes and suffixes. 7. Add default stresses (primary stress on first syllable of first stressed component, etc.). 8. Handle inflectional suffixes: adjectival -e, -en, -em, -er, -es, also after adjectival -er, -st; nominal -(e)s, -(e)n; verbal -e, -(e)st, -(e)t, -(e)n, -(e)te, -(e)tst, -(e)tet, -(e)ten. 9. Ignore final period/question mark/exclamation point. 10. Implement nasal vowels. 11. Implement underscore to prevent assimilation/interpretation as a multigraph. 12. Implement [b] [d] [g] [z] [v] [s] [x]. 13. Implement dot-under to prevent stress. 14. Implement dot-over on a vowel to prevent a given autogenerated stress mark from being displayed. 15. Check allowed onsets with prefixes. 16. Implement allowed offsets and check with suffixes. 17. Implement 'style' ala Spanish pronun to handle standard vs. northern/Eastern vs. southern: e.g. berufsunfähig: /bəˈʁuːfs(ʔ)ʊnˌfɛːɪç/ /-ʔʊnˌfeːɪç/ /-ʔʊnˌfɛːɪk/ e.g. aufrichtig: /ˈaʊf.ʁɪç.tɪç/ /ˈaʊf.ʁɪç.tɪk/ e.g. Universität: /ˌuni.vɛʁ.ziˈtɛːt/ /ˌuni.vɛʁ.ziˈteːt/ 18. Implement splitting prefix/suffix pronun for -sam, with two pronunciations. 19. n before g/k in the same syllable should be ŋ. Sometimes also across syllables, cf. Ingrid. 20. Written 'ts' in the same syllable should be rendered with a tie, e.g. aufwärts, Aufenhaltsgenehmigung. Botsuana is tricky as it normally would have syllable division 't.s', but maybe we should special-case it so we get /bɔˈt͡su̯aːna/. Other examples: enträtseln, Fietse, Lotse, Mitsubishi, Rätsel, Hatsa, Tsatsiki, Whatsapp. In Outsider and Outsourcing, the 't' and 's' are pronounced separately, respelled 'Aut-s*aider' and 'Aut-s*ŏhßing' (or similar). 21. Implement handling of written 'y'. 22. Implement double hyphen and conversion of secondary to tertiary accents. 23. Handle unstressed words like und and von correctly. 24. Implement optional glottal stops when the following syllable isn't stressed (but only after a consonant; cf.   wiederentdecken ˈviːdɐʔɛntˌdɛkən). 25. Use x* etc. to get EXPLICIT_X etc. instead of [x], so that [...] can be used after a word to indicate word properties. 26. Handle prefixes/suffixes/interfixes indicated with initial and/or final hyphen. 27. t-s- should render as t͡s e.g. Gleichheitszeichen respelled 'Gleichheit-s-zeichen' and Aufenthaltstitel respelled 'Aufenthalt-s-titel'. 28. Need syllable dividers at component boundaries before unstressed syllables. 29. Open syllables with secondary stress get close but not lengthened vowel if there is a following primary stress within the component bounadry; cf. iterativ 'ìtêratív' [ˌiteʁaˈtiːf], Lethargie 'Lèthargie' [ˌletaʁˈɡiː]. Across a component bounadry this doesn't apply, so that e.g. hyperaktiv 'hỳper-áktihv' would be [ˌhyːpɐˈʔaktiːf]; if we need short /y/, either make it unstressed, e.g. 'hyper<aktihv' or use *, e.g. 'hỳ*per-áktiv'. 30. Use [vf] after a word for "verb form" to get verbal endings -st, -t, -tet etc. recognized. Expand this to handle all inflectional endings. 31. Support non-initial capital O for o̯, non-initial N for nasal vowel including after lengthening h, digraphs eI and oU for /ɛɪ̯/ and /ɔʊ̯/. 32. Nasal vowels should be long when stressed, and use the phonemes ã ɛ̃ õ œ̃ per Wikipedia. 33. Remove primary stress from a single-syllable word. 34. -ik- in the middle of a word should have short 'i', e.g. Musikerin; not sure if also applies when stress follows. Cf. Abdikation /ˌapdikaˈt͡si̯oːn/, Partikel /paʁˈtɪkəl/ or also /paʁˈtiːkəl/, Affrikate /ˌafʁiˈkaːtə/, Afrika /ˈaːfʁika/ or /ˈafʁika/, afrikaans /ˌafʁiˈkaːns/, Agnostiker /aˈɡnɔstɪkɐ/, Agrikultur /ˌaɡʁikʊlˈtuːʁ/, Akademikerin /akaˈdeːmɪkəʁɪn/, Silikat /ziliˈkaːt/, Amerika /aˈmeːʁika/, amikal /amiˈkaːl/, Anabolikum /anaˈboːlikʊm/, Syndikalismus /zʏndikaˈlɪsmʊs/, Angelika /aŋˈɡeːlika/, Anglikaner /aŋɡliˈkaːnɐ/, Antibiotikum /antiˈbi̯oːtikʊm/, Antipyretikum /antipyˈʁeːtikʊm/, apikal /apiˈkaːl/, appendikuliert /apɛndikuˈliːʁt/, Applikation /aplikaˈt͡si̯oːn/, Aprikose /ˌapʁiˈkoːzə/, Olympionikin /olʏmpi̯oˈniːkɪn/, Tsatsiki /t͡saˈt͡siːki/, Artikel /ˌaʁˈtiːkəl/ or /ˌaʁˈtɪkəl/, Batiken /ˈbaːtɪkən/. Seems to apply only to unstressed '-ik-' followed by 'e' + no stress. 35. 'h' between vowels should be lengthening only if no stress follows and the following vowel isn't a or o, and if   followed by i or u, that vowel should not be word-final. (DONE) 36. Re-parse prefix/suffix respellings for <, e.g. auseinander-. 37. Reimplement prefix-type restrictions using a finite state machine and handle secondary stress appropriately. 38. ks should divide .ks but shorten preceding vowels. (DONE) 39. Implement component_like_suffixes. 40. Make sure kohlenhydratreich respelled 'kohlen-hydrȁt>reich' works. 41. Redo multicomponent handling according to recent changes whereby acute and grave indicate relative stress rather than absolute stress; absolute stress can be specified using ˈ and ˌ or ' (single quote) and, (comma). Component stress is specified using * (primary stress) or ** (secondary stress) at the beginning of the component. 42. Prefix standing alone as a word or component should be recognized and respelled appropriately; cf. words like vorüber, auseinander, ab, um. ]=]

local export = {}

local force_cat = false -- for testing

local strutils = require("Module:string utilities") local m_table = require("Module:table") local m_IPA = require("Module:IPA") local lang = require("Module:languages").getByCode("de")

- Aliases for Unicode-safe versions of string functions -

local u = mw.ustring.char local rsubn = mw.ustring.gsub local rfind = mw.ustring.find local rmatch = mw.ustring.match local rsplit = mw.text.split local rgsplit = mw.text.gsplit local ulen = mw.ustring.len local ulower = mw.ustring.lower

- Unicode combining characters -

local ACUTE = u(0x0301) -- COMBINING ACUTE ACCENT = ́ local GRAVE = u(0x0300) -- COMBINING GRAVE ACCENT = ̀ local CFLEX = u(0x0302) -- COMBINING CIRCUMFLEX ACCENT = ̂ local TILDE = u(0x0303) -- COMBINING TILDE = ̃ local MACRON = u(0x0304) -- COMBINING MACRON = ̄ local BREVE = u(0x0306) -- COMBINING BREVE = ̆ local DIA = u(0x0308) -- COMBINING DIAERESIS = ̈ local DOTOVER = u(0x0307) -- COMBINING DOT ABOVE = ̇ local DOUBLEGRAVE = u(0x030F) -- COMBINING DOUBLE GRAVE ACCENT = ̏ local UNRELEASED = u(0x031A) -- COMBINING LEFT ANGLE ABOVE = ̚ local DOTUNDER = u(0x0323) -- COMBINING DOT BELOW = ̣ local UNVOICED = u(0x0325) -- COMBINING RING BELOW = ̥ local CEDILLA = u(0x0327) -- COMBINING CEDILLA = ̧ local SYLLABIC = u(0x0329) -- COMBINING VERTICAL LINE BELOW = ̩ local INVBREVEBELOW = u(0x032F) -- COMBINING INVERTED BREVE BELOW = ̯ local LINEUNDER = u(0x0331) -- COMBINING MACRON BELOW = ̱ local TIE = u(0x0361) -- COMBINING DOUBLE INVERTED BREVE = ͡

- Pattern-substitution function wrappers -

-- version of rsubn that discards all but the first return value local function rsub(term, foo, bar, n)	local retval = rsubn(term, foo, bar, n)	return retval end

-- version of rsubn that returns a 2nd argument boolean indicating whether a substitution was made local function rsubb(term, foo, bar, n)	local retval, nsubs = rsubn(term, foo, bar, n)	return retval, nsubs > 0 end

-- like str:gsub but discards all but the first return value local function gsub(term, foo, bar, n)	local retval = term:gsub(foo, bar, n)	return retval end

-- apply rsub repeatedly until no change local function rsub_repeatedly(term, foo, bar, n)	while true do		local new_term = rsub(term, foo, bar, n)		if new_term == term then return term end term = new_term end end

- Internally-used symbols -

-- We use the following internally: -- ⁀ represents a component boundary. Two components within an orthographic word are separated by ⁀; the edge of an --  orthographic word is denoted by ⁀⁀. The division between prefix and component, or between prefix and prefix, is --  also denoted by ⁀. -- ‿ represents the division between component and suffix, or suffix and suffix. We need to distinguish this from ⁀ --  for certain purposes, e.g. 'st' at the beginning of a component is /ʃt/ but 'st' at the beginning of a suffix is --   /st/. -- . represents a syllable boundary. -- AUTOACUTE, AUTOGRAVE, ORIG_SUFFIX_GRAVE: Described below. -- EXPLICIT_*: Described below.

-- Used to temporarily substitute capital I and U when lowercasing in canonicalize_and_split_words. It is OK if these -- overlap with other substitution symbols because their use is before any of the other symbols are used. local TEMP_I = u(0xFFF0) local TEMP_U = u(0xFFF1)

-- When auto-generating primary and secondary stress accents, we use these special characters, and later convert to -- normal IPA accent marks, so we can distinguish auto-generated stress from user-specified stress. local AUTOACUTE = u(0xFFF0) local AUTOGRAVE = u(0xFFF1) -- An auto-generated secondary stress in a suffix is converted to the following if the word is not composed of -- multiple (hyphen or double-hyphen separated) components. It is eventually converted to secondary stress (if there is -- no preceding secondary stress, and the directly preceding syllable does not have primary stress), and otherwise -- removed. local ORIG_SUFFIX_GRAVE = u(0xFFF2) local SYLDIV = u(0xFFF3)

- Explicit allophone notation -

-- When the user uses the "explicit allophone" notation such as [z] or [x] to force a particular allophone, we -- internally convert that notation into a single special character. local EXPLICIT_S = u(0xFFF4) local EXPLICIT_Z = u(0xFFF5) local EXPLICIT_V = u(0xFFF6) local EXPLICIT_B = u(0xFFF7) local EXPLICIT_D = u(0xFFF8) local EXPLICIT_G = u(0xFFF9) local EXPLICIT_X = u(0xFFFA)

-- Map "explicit allophone" notation into special char. See above. local char_to_explicit_char = { ["s"] = EXPLICIT_S, ["z"] = EXPLICIT_Z, ["v"] = EXPLICIT_V, ["b"] = EXPLICIT_B, ["d"] = EXPLICIT_D, ["g"] = EXPLICIT_G, ["x"] = EXPLICIT_X, }

-- Map "explicit allophone" notation into normal spelling, for supporting ann=. local char_to_spelling = { ["s"] = "s", ["z"] = "s", ["v"] = "v", ["b"] = "b", ["d"] = "d", ["g"] = "g", ["x"] = "ch", }

-- Map "explicit allophone" notation into phonemes. local explicit_char_to_phonemic = { [EXPLICIT_S] = "s", [EXPLICIT_Z] = "z", [EXPLICIT_V] = "v", [EXPLICIT_B] = "b", [EXPLICIT_D] = "d", [EXPLICIT_G] = "ɡ", -- IPA ɡ! [EXPLICIT_X] = "x", }

- Character classes to simplify use of patterns -

local stress = ACUTE .. GRAVE .. DOUBLEGRAVE .. AUTOACUTE .. AUTOGRAVE .. ORIG_SUFFIX_GRAVE .. "ˈˌ" local stress_c = "[" .. stress .. "]" local non_stress_c = "[^" .. stress .. "]" local acute = ACUTE .. AUTOACUTE local acute_c = "[" .. acute .. "]" local grave = GRAVE .. AUTOGRAVE local grave_c = "[" .. grave .. "]" local accent_non_stress_non_invbrevebelow = BREVE .. CFLEX .. TILDE .. DOTUNDER .. MACRON .. "ː*" local accent_non_stress = accent_non_stress_non_invbrevebelow .. INVBREVEBELOW local accent_non_stress_c = "[" .. accent_non_stress .. "]" local accent = stress .. accent_non_stress local accent_c = "[" .. accent .. "]" local accent_non_invbrevebelow = stress .. accent_non_stress_non_invbrevebelow local accent_non_invbrevebelow_c = "[" .. accent_non_invbrevebelow .. "]" local non_accent_c = "[^" .. accent .. "]" -- Use both IPA symbols and letters so it can be used at any step of the process. local back_vowel_non_glide = "aɑoɔuʊ" local back_vowel = back_vowel_non_glide .. "U" local back_vowel_c = "[" .. back_vowel .. "]" local front_vowel_non_glide = "eɛiɪyʏøœäöü" local front_vowel = front_vowel_non_glide .. "I" local schwalike = "əɐ" local vowel = back_vowel .. front_vowel .. schwalike local V = "[" .. vowel .. "]" local non_V = "[^" .. vowel .. "]" local vowel_non_glide = back_vowel_non_glide .. front_vowel_non_glide .. schwalike local V_non_glide = "[" .. vowel_non_glide .. "]" local vowel_unmarked_for_quality = "aeiouäöü" local V_unmarked_for_quality = "[" .. vowel_unmarked_for_quality .. "]" local charsep = accent .. "." .. SYLDIV .. "‿" local charsep_c = "[" .. charsep .. "]" local wordsep = charsep .. " ⁀" local wordsep_c = "[" .. wordsep .. "]" local cons_guts = "^" .. vowel .. wordsep .. "_" -- guts of consonant class local C = "[" .. cons_guts .. "]" -- consonant local C_not_lr = "[" .. cons_guts .. "lr]" -- consonant not 'l' or 'r' local C_not_h = "[" .. cons_guts .. "h]" -- consonant not 'h' local C_not_s = "[" .. cons_guts .. "s]" -- consonant not 's' -- Include both regular g and IPA ɡ so it can be used anywhere local obstruent_non_sibilant = "pbfvkgɡtdxç" .. EXPLICIT_B .. EXPLICIT_D .. EXPLICIT_G .. EXPLICIT_V .. EXPLICIT_X local obstruent_non_sibilant_c = "[" .. obstruent_non_sibilant .. "]" local unvoiced_cons = "ptkfsßʃxç" .. EXPLICIT_S .. EXPLICIT_X local unvoiced_C = "[" .. unvoiced_cons .. "]"

- Allowed onsets for prefix-splitting -

-- This is a list of patterns that collectively describe all possible orthographic word onsets. This is used when -- splitting off prefixes like ge- and be- to prevent splitting in words like Gertrud ('rtr-' is not a -- possible word onset). local allowed_onsets = { "",	"[bcdfghjklmnpqrstvwxz]", -- Many single consonants can be followed by j but they are all foreign terms that can't be prefixed. "ch[lr]?", "sch[lmnrw]?", "[td]sch", "[kg]n", "[kgbpfc]l", "[kgbpdtfc]r", "[kgbpdt]h", "ph[lr]", "thr", "s[ckpt]r?", "s[pk]l", "s[pt]h", "sz", "vl", "w[lr]", }

- Possible state transitions when handling prefix combinations -

-- We view the possible combinations of prefixes through the lens of a finite state machine. This allows us to -- handle occasional cases where more than one of a given type of prefix occurs (e.g. überzubeanspruchen). -- NOTE ABOUT STRESS: After un-, stressed prefixes lose their stress and the root takes secondary stress; cf. -- 'únausgegòren'. But otherwise if there are two stressed prefixes, the second one takes secondary stress and -- the root loses the stress; cf. 'ǘberbeànspruchen', 'ǘberzubeànspruchen'. local prefix_previous_allowed_states = { -- un- can occur after unstressed prefixes; cf. verunglücken, Verunreinigung, verunstalten, -- beunruhigen ["un"] = m_table.listToSet { "unstressed" }, -- stressed prefixes can occur after un- (unausgegoren) or after unstressed prefixes (beauftragen,	-- beabsichtigen, veranlagen) ["stressed"] = m_table.listToSet { "un", "unstressed" }, -- unstressed prefixes can occur after un- (unzerstörbar), after stressed prefixes (aufbewahren,	-- Aufenthalt), or after unstressed -zu- (aufzubewahren) ["unstressed"] = m_table.listToSet { "un", "stressed", "unstressed-zu" }, -- unstressed -zu- can occur after stressed prefixes only (anzufangen) ["unstressed-zu"] = m_table.listToSet { "stressed" }, }

- Recognized prefixes -

--[=[ The following contains prefixes to be recognized specially, segmented off automatically and respelled appropriately. The format of each entry is {PREFIX, RESPELLING, PROPS...} where PREFIX is the actual spelling of the prefix (without any stress marks), RESPELLING is the appropriate phonemic respelling, spelled as if the prefix were a word by itself (with appropriate acute and grave accents marking primary and secondary stress). PROPS are optional named properties; see below. Order of the entries is important as they are checked in order. In particular, B must follow A if B is a prefix of A. For example, "her" must follow "heraus", "herbei", "herüber", etc. and "un" must follow "unter".

The following named properties are currently recognized: the part of the word to the right of the prefix, or a list of such regexes, where one of them must match in the same fashion. An example is "emp", which has a restriction so that it is only recognized in the form "empf-", i.e. where the rest of the word begins with an f.  The types "stressed" and "unstressed" are autodetected and don't normally need to be given, but the other two must be given using 'prefixtype' whenever they are applicable. in its original form, not respelled form -- with the appropriate vowel given a grave accent. This can be omitted if the secondary stress goes on the first vowel. This is used so that the user can put secondary stress on a prefix (whether or not explicitly sectioned off using '<') and still have it recognized and respelled appropriately. An example is unkalkulierbar, respelled 'ùnkalkulierbar', which generates /ˌʊnkalkuˈliːʁbaːʁ/. Note for example that the 'n' shows up as /n/ not /ŋ/ before 'k', and 'u' shows up as /ʊ/ not /u/, which indicates that 'ùn-' is correctly sectioned off as a prefix and given a respelling of 'ùnn-'. word. This is used with prefixes such as 'ab-', 'an-', 'her-', 'mit-', 'ur-' that frequently form non-prefix parts of foreign-origin words (e.g. abundant, Animation, hereditär, mitigieren, Urbanisierung). This is a heuristic as most foreign-origin words have non-initial stress; but it produces some false positives e.g. Abenteuer, Annika, Herkules, uruguayisch (which need respelling like 'Ab+enteuer' with '+' to indicate no prefix boundary or alternatively 'Ábenteuer' with explicit stress) and occasional false negatives e.g. abalienieren, anhand (which need respelling using '<', like 'an<hand' or 'àb<ali̯enieren'; the latter also puts secondary stress on the prefix, which will be propagated onto the respelling 'àbb', and uses 'i̯' to get  a glide rather than 'ie' being interpreted as a long vowel).
 * 'restriction': Place additional restrictions on when the prefix can occur. Value is either a regex that must match
 * 'prefixtype': Override the autodetected type of prefix, one of "stressed", "unstressed", "unstressed-zu" or "un".
 * 'secstress': If the prefix is stressed, i.e. its respelling has an acute accent, this should specify the prefix --
 * 'not_with_following_primary_stress': Do not recognize this prefix if there is a primary stress later on in the

Some prefixes can be both stressed and unstressed, e.g. durch-, unter-, über-, wieder-. For some, e.g. miss- and wider-, there are systematic alternations in stress: unstressed when functioning as a verbal prefix followed by an initial-stressed verb, stressed otherwise. This is too complex and unpredictable for us to handle, so we treat all these prefixes as stressed. Respell using < when unstressed, e.g. 'umfahren' "to knock down with a vehicle", 'um<fahren' "to drive around, to bypass". ]=]

local prefixes = { -- To reduce false positives, we don't recognize when main part or suffix has primary stress, e.g. abandonnieren -- /abandɔnˈiːʁən/, Abasie /abaˈziː/, Abbreviation /abʁevi̯aˈt͡si̯oːn/, Abbreviatur /abʁevi̯aˈtuːʁ/, -- Abchasisch /apˈxaːzɪʃ/, abdominal /apdomiˈnaːl/, Abduktor /apˈdʊktoːʁ/, abessinisch /abɛˈsiːnɪʃ/, -- Abitur /ˌabiˈtuːʁ/, Abonnement /abɔnəˈmɑ̃ː/, abonnieren /abɔˈniːʁən/, Abort (one meaning) -- /aˈbɔʁt/, Abrasion /apʁaˈzi̯oːn/ or /abʁaˈzi̯oːn/, abrasiv /abʁaˈziːf/, Absence /apˈsãːs/, Absinth -- /apˈzɪnt/, absolut /apzoˈluːt/, Absolvent /apz̥ɔlˈvɛnt/, absorbieren /apzɔʁˈbiːʁən/, Absorption -- /apzɔʁpˈt͡si̯oːn/, abstinent /apstiˈnɛnt/, Abstinenz /apstiˈnɛnt͡s/, abstrahieren /apstʁaˈhiːʁən/, -- Abstraktum /apˈstʀaktʊm/, abstrus /apˈstʁuːs/, absurd /apˈzʊʁt/, Absurdität /ˌapzʊʁdiˈtɛːt/, -- Abtei /apˈtaɪ̯/, Abu Dhabi ˈabu ˈdaːbi, Abulie /abuˈliː/, abundant /abʊnˈdant/. --	-- We don't restrict to not preceding vowels because of words like abändern, abarbeiten, Abart, -- abeisen, aberkennen, abirren, abisolieren, Abordnung, Abort (one meaning). --	-- We still have a few false positives needing '+', e.g. Abakus /ˈabakʊs/, Abend /ˈaːbənt/, Abenteuer -- /ˈaːbəntɔɪ̯ɐ/, (aber /ˈaːbɐ/; main part too short so won't be segmented in any case), Ablativ -- /ˈaplaˌtiːf/, Abraham /ˈaːbʁaˌha(ː)m/. --	-- We have a few false negatives needing '<' or '<<', e.g. Abalienation /ˌapˌʔali̯enaˈt͡si̯oːn/, abalienieren -- /ˌapʔali̯eˈniːʁən/, abaxial /apʔaˈksi̯aːl/, abhanden /apˈhandən/, abscheulich /apˈʃɔɪ̯lɪç/. {"ab", "ább", not_with_following_primary_stress = true}, {"aneinander", "ànn<einánder", secstress = "aneinànder"}, {"anheim", "ann<héim", secstress = "anhèim"}, -- Must follow aneinander- and anheim-. --	-- To reduce false positives, we don't recognize when main part or suffix has primary stress, e.g. anabol -- /anaˈboːl/, Anabolikum /anaˈboːlikʊm/, anal /aˈnaːl/, analog /anaˈloːk/, Analogie /analoˈɡiː/, -- Analyse /anaˈlyːzə/, analysieren /ˌanalyːˈziːʁən/, Analysis /aˈnaːlyzɪs/, analytisch -- /ˌanaˈlyːtɪʃ/, anamorph /anaˈmɔʁf/, Anapäst /anaˈpɛːst/, Anaptyxe /anapˈtʏksə/, anarchisch -- /aˈnaʁxiʃ/, Anathema /aˈnaːtema/, Anatomie /anatoˈmiː/, anatomisch /anaˈtoːmɪʃ/, Anchovis -- /anˈʃoːvɪs/, Andalusien /ˌandaˈluːzi̯ən/, Andorraner /andɔˈʁaːnɐ/, Andrea /anˈdʁeːa/, Androgen -- /andʁoˈɡeːn/, androgyn /andʁoˈɡyːn/, Android /andʁoˈiːt/, Andrologie /andʁoloˈɡiː/, Anekdote -- /anɛkˈdoːtə/, Angela /aŋˈɡeːla/, Angeliter /aŋɡeˈliːtɐ/, Anglikaner /aŋɡliˈkaːnɐ/, Anglistik -- /aŋˈɡlɪstɪk/, Angola /aŋˈɡoːlaː/, angolanisch /aŋɡoˈlaːnɪʃ/, Anilingus /aniˈlɪŋɡʊs/, Animation -- /animaˈt͡si̯oːn/, animieren /aniˈmiːʁən/, Animosität /ˌanimoziˈtɛːt/, Anis /aˈniːs/, Annalen -- /aˈnaːlən/, annektieren /anɛkˈtiːʁən/, annullieren /anʊˈliːʁən/, anonym /ˌanoˈnyːm/, Anonymität -- /anonymiˈtɛːt/, Antagonismus /antaɡoˈnɪsmʊs/, Antenne /anˈtɛnə/, anterior /anˈteːʁioːʁ/, -- Anthrazit /ˌantʁaˈt͡siːt/ or /ˌantʁaˈt͡sɪt/, Anthropologie /antʁopoloˈɡiː/, anthropomorph -- /antʁopoˈmɔʁf/, Anthropomorphismus /ˌantʁopomɔʁˈfɪsmʊs/, Anthroposophie /antʁopozoˈfiː/, Antonym -- /antoˈnyːm/, Antonymie /antonyˈmiː/, Antwerpen /antˈvɛʁpən/. --	-- We don't restrict to not preceding vowels because of words like anecken /ˈanˌʔɛkən/, Aneignung -- /ˈanˌʔaɪ̯ɡnʊŋ/, anekeln /ˈanˌʔeːkəln/, Anerbieten /ˈan(ʔ)ɛʁˌbiːtən/, anerkennen /ˈan(ʔ)ɛʁˌkɛnən/, -- anöden /ˈanˌʔøːdən/, anordnen /ˈanˌʔɔʁdnən/. --	-- We still have some false positives needing '+', e.g. Ananas /ˈananas/, (Anden /ˈandən/; main part too	-- short so won't be segmented in any case), anderens /ˈandərəns/, anderer /ˈandərər/, anderleuts -- /ˈandɐˌlɔɪ̯t͡s/, Aneis /ˈanaɪ̯s/, (Angel /ˈaŋəl/; main part too short so won't be segmented in any case), -- Angela /ˈaŋɡela/ or /ˈaŋəla/, angeln /ˈaŋəln/, (Anger /ˈaŋər/; main part too short so won't be	-- segmented in any case), (Angler /ˈaŋlɐ/; main part too short so won't be segmented in any case), Anika -- /ˈaniˌka/, Anime /ˈanime/, (Anis /ˈaːnɪs/ or /ˈanɪs/; main part too short so won't be segmented in any	-- case), ankern /ˈaŋkɐn/, (Anna /ˈʔana/; main part too short so won't be segmented in any case), Annam -- /ˈanam/, Annika /ˈaniˌka/, Anorak /ˈanoʁak/, Anton /ˈantoːn/ (Anus /ˈʔaːnʊs/; main part too	-- short so won't be segmented in any case). --	-- We have a few false negatives needing '<' or '<<': aneinander /anʔaɪ̯ˈnandɐ/, anhand /anˈhant/. --	-- We add a restriction to not segment in anti-. {"an", "ánn", not_with_following_primary_stress = true, restriction = {"^[^t]", "^t[^i]"}}, {"aufeinander", "auf<einánder", secstress = "aufeinànder"}, -- Must follow aufeinander-. {"auf", "áuf"}, {"auseinander", "aus<einánder", secstress = "auseinànder"}, -- Must follow auseinander-. {"aus", "áus"}, {"außer", "áußer"}, {"beieinander", "bei<einánder", secstress = "beieinànder"}, -- FIXME, secondary stress in beiseite should get demoted to tertiary when handling explicit secondary stress -- from user. {"beiseite", "beiséite", secstress = "beisèite"}, -- Must follow beieinander-. {"bei", "béi"}, -- Allow be- before -u- only in beur-, beun-; cf. beurlauben, Beunruhigung. Must follow bei-. {"be", "bə", restriction = {"^[^ui]", "^u[rn]"}}, {"dafür", "dafǘr", secstress = "dafǜr"}, {"dagegen", "dagégen", secstress = "dagègen"}, {"daher", "dahér", secstress = "dahèr"}, {"dahinter", "dahínter", secstress = "dahìnter"}, -- Must follow dahinter-. {"dahin", "dahínn", secstress = "dahìn"}, {"daneben", "danében", secstress = "danèben"}, -- To reduce false positives, we don't recognize when main part or suffix has primary stress, e.g. darauf -- /daˈʁaʊ̯f/, and additionally include a restriction to not segment when a vowel follows, e.g. the alternative -- pronunciation of darauf as /ˈdaːʁaʊ̯f/ (without this restriction we'd get #/ˈdaːʁˌʔaʊ̯f/). The -- `not_with_following_primary_stress` condition is mostly redundant given the non-vowel restriction. {"dar", "dár", not_with_following_primary_stress = true, restriction = "^" .. C}, {"davon", "dafónn", secstress = "davòn"}, {"davor", "dafór", secstress = "davòr"}, {"dazu", "dazú", secstress = "dazù"}, {"durcheinander", "durch<einánder", secstress = "durcheinànder"}, -- Must follow durcheinander-. {"durch", "dúrch"}, {"ein", "éin"}, {"empor", "empór", secstress = "empòr"}, -- Must follow empor-. {"emp", "emp", restriction = "^f"}, {"entgegen", "entgégen", secstress = "entgègen"}, {"entlang", "ent.láng", secstress = "entlàng"}, {"entzwei", "ent.zwéi", secstress = "entzwèi"}, -- Must follow entgegen- and entlang-. {"ent", "ent"}, {"er", "err"}, {"fort", "fórt"}, {"gegenüber", "gehgen<ǘber", secstress = "gegenǜber"}, -- Most words in 'gei-' aren't past participles, cf. Geier, Geifer, geifern, Geige, geigen, -- Geiger, geil, geilo, Geisel, Geiser, Geisha, Geiß, Geißel, geißeln, -- Geist, Geister, geistig, Geiz, geizen. There are only a few, e.g. geimpft, which need -- respelling, e.g. 'ge<impft'. No restriction on 'geu-' because only one non-past-participle observed: -- Geusenwort (which needs respelling like 'Géusen-wort'), and there are various past participles in 'geu-', -- especially 'geur-'. {"ge", "gə", restriction = "^[^i]"}, {"herab", "herrább", secstress = "heràb"}, {"heran", "herránn", secstress = "heràn"}, {"herauf", "herráuf", secstress = "heràuf"}, {"heraus", "herráus", secstress = "heràus"}, {"herbei", "herbéi", secstress = "herbèi"}, {"herein", "herréin", secstress = "herèin"}, {"hernieder", "herníeder", secstress = "hernìeder"}, {"herüber", "herrǘber", secstress = "herǜber"}, {"herum", "herrúmm", secstress = "herùm"}, {"herunter", "herrúnter", secstress = "herùnter"}, {"hervor", "herfór", secstress = "hervòr"}, -- Must follow herab-, heran-, etc.	-- -- To reduce false positives, we don't recognize when main part or suffix has primary stress, e.g. Heraldik -- /heˈʁaldɪk/, Herbarium /hɛʁˈbaːʁi̯ʊm/, hereditär /heʁediˈtɛːʁ/, herein /hɛˈʁain/, Hermaphrodit -- /ˌhɛʁ.ma.fʁoˈdiːt/ or /ˌhɛʁm.ʔa.fʁoˈdiːt/, Hermelin /hɛʁməˈliːn/, Hermeneutik /ˌhɛʁmeˈnɔɪ̯tɪk/, -- hermetisch /hɛʁˈmeːtɪʃ/, hermitesch /hɛʁˈmiːtɛʃ/, Heroin /heʁoˈiːn/, heroisch /heˈʁoːɪʃ/, -- Herold /ˈheːrɔlt/, Herzegowina /ˌhɛʁt͡seˈɡoːvina/ or /ˌhɛʁt͡seɡoˈviːna/. --	-- We still have some false positives needing '+', e.g. Herberge /ˈhɛʁˌbɛʁɡə/, Hering /ˈheːʁɪŋ/, -- Herkules /ˈhɛʁkuˌlɛs/, Herling /ˈheːʁ.lɪŋ/, Hermann /ˈhɛʁ.man/, (Heros /ˈheːʁɔs/; main part too	-- short so won't be segmented in any case), Herrin /ˈhɛʁɪn/, herrisch /ˈhɛʁɪʃ/, herzig /ˈhɛʁt͡sɪç/, -- Herzog /ˈhɛʁˌt͡soːk/. {"her", "hér", not_with_following_primary_stress = true}, {"hinab", "hinnább", secstress = "hinàb"}, {"hinan", "hinnánn", secstress = "hinàn"}, {"hinauf", "hinnáuf", secstress = "hinàuf"}, {"hinaus", "hinnáus", secstress = "hinàus"}, {"hindurch", "hindúrch", secstress = "hindùrch"}, {"hinein", "hinnéin", secstress = "hinèin"}, {"hintan", "hint<ánn", secstress = "hintàn"}, {"hinterher", "hinter<hér", secstress = "hinterhèr"}, {"hinter", "hínter"}, {"hinüber", "hinnǘber", secstress = "hinǜber"}, {"hinunter", "hinnúnter", secstress = "hinùnter"}, {"hinweg", "hinwéck", secstress = "hinwèg"}, -- Must follow hinab-, hinan-, etc.	{"hin", "hínn"}, -- too many false positives for in- {"miss", "míss"}, {"mit", "mítt", not_with_following_primary_stress = true}, {"nach", "nahch"}, {"nebeneinander", "nehben<einánder", secstress = "nebeneinànder"}, {"neben", "nében"}, {"nieder", "níeder"}, {"übereinander", "ühber<einánder", secstress = "übereinànder"}, -- Must follow übereinander-. {"über", "ǘber"}, -- Unstressed variant of über-. We include this for cases like zürucküberweisen. {"über", "ühber"}, -- umeinander- only dialectal (West Bavarian) {"um", "úmm"}, {"unter", "únter"}, -- Must follow unter-. Has its own prefixtype; cf. unvorhergesehen /ˈʊnfoːʁheːʁɡəˌzeːən/. {"un", "únn", prefixtype = "un"}, -- To reduce false positives, we don't recognize when main part or suffix has primary stress, e.g. Urämie -- /uʁɛˈmiː/, Uran /uˈʁaːn/, uranhaltig /uˈʁaːnˌhaltɪç/, urban /ʊʁˈbaːn/, urbanisieren -- /ʊʁbaniˈziːʁən/, urbanophil /ʊʁbanoˈfiːl/, Urethan /uʁeˈtaːn/, urgieren /ʊʁˈɡiːʁən/, Urin -- /uˈʁiːn/, urinal /uʁiˈnaːl/, urinieren /uʁiˈniːʁən/, uroborisch /uˈʁoːboʁɪʃ/, Urologe -- /ˌuʁoˈloːɡə/, Urologin /ˌuʁoˈloːɡɪn/, urologisch /ˌuːʁoˈloːɡɪʃ/. We don't restrict to not preceding -- vowels because of words like uramerikanisch, Uraufführung, Ureinwohner, Uropa. We still -- have a few false positives needing '+', e.g. Uranus /ˈuːʁanʊs/, uruguayisch /ˈuːʁuɡvaɪ̯ɪʃ/. --	-- Stress pattern is like un-; Urabstimmung /ˈuːʁʔapˌʃtɪmʊŋ/, Uraufführung /ˈuːʁʔaʊ̯fˌfyːʁʊŋ/, -- uraufgeführt /ˈuːʁʔaʊ̯fɡəˌfyːʁt/, Ureinwohner /ˈuːʁʔaɪ̯nˌvoːnɐ/; note Urzustand given in dewikt as -- /ˈuːʁˌt͡suːʃtant/ but audio sounds more like /ˈuːʁt͡suːˌʃtant/. {"ur", "úr", prefixtype = "un", not_with_following_primary_stress = true}, {"ver", "ferr"}, -- vorab-: only vorabeintscheiden {"voran", "foránn", secstress = "voràn"}, -- vorauf-: only voraufgehen {"voraus", "foráus", secstress = "voràus"}, {"vorbei", "fohrbéi", secstress = "vorbèi"}, -- respell per dewikt pronun {"vorher", "fohrhér", secstress = "vorhèr"}, -- respell per dewikt pronun {"vorüber", "forǘber", secstress = "vorǜber"}, -- Must follow voran-, voraus-, etc.	{"vor", "fór"}, {"weg", "wéck"}, {"weiter", "wéiter"}, {"wider", "wíder"}, {"wieder", "wíeder"}, {"zer", "zerr"}, {"zueinander", "zu<einánder", secstress = "zueinànder"}, {"zurecht", "zurécht", secstress = "zurècht"}, {"zurück", "zurǘck", secstress = "zurǜck"}, {"zusammen", "zusámmen", secstress = "zusàmmen"}, -- Listed twice, first as stressed then as unstressed, because of zu-infinitives like anzufangen. At the -- beginning of a word, stressed zú- will take precedence, but after another prefix, stressed prefixes can't occur, -- and unstressed -zu- will occur. Must follow zueinander-, zurecht-, etc.	{"zu", "zú"}, -- We use a separate type for unstressed -zu- because it can be followed by another unstressed prefix, e.g. -- auszubedingen, whereas normally two unstressed prefixes cannot occur. {"zu", "zu", prefixtype = "unstressed-zu"}, {"zwischen", "zwíschen"}, }

- Recognized "component-like" suffixes -

--[=[ The following contains suffixes that maintain their stress after a stressed syllable, as in handfest, reißfest, and "steal" the secondary stress, as in albtraumhaft /ˈalptʁaʊ̯mˌhaft/. The format of each entry is approximately the same as for prefixes above. The differences are: must follow "ierbar" and "haft" must follow "schaft". output. This is used e.g. for "-sam", which has two possible pronunciations, with the vowel long or short. (FIXME: We don't actually support this yet.) of 'n' = noun, 'v' = verb, 'a' = adjective, 'b' = adverb, or a list of such strings). This is currently underused and may be deleted.  example, "erweise" is restricted to follow a consonant. ]=]
 * Order matters, as for prefixes, but in this case the rule is that B must follow A if B is a suffix of A, e.g. "bar"
 * The respelling can be a list of possible pronunciations, in which case multiple pronunciations are produced on
 * There is an additional named property 'pos' that is intended to specify the resulting part of speech (a string, one
 * The named properties 'prefixtype' and 'not_with_following_primary_stress' do not exist.
 * The named property 'restriction' works as for prefixes except that it matches the part *before* the suffix. For

local component_like_suffixes = { -- Not necessary; we split off -erweise in a first pass, and then -lich will be recognized. -- {"licherweise", ">lich>er-weise", pos = "b"}, {"erweise", ">er--weise", restriction = C .. "$", pos = "b"}, -- Examples: bibelfest /ˈbiːbəlˌfɛst/ (would be same if regular stress), bissfest /ˈbɪsˌfɛst/, handfest -- /ˈhantˌfɛst/, kratzfest /ˈkʁat͡sˌfɛst/, krisenfest /ˈkʁiːzənˌfɛst/ (would be same if regular stress), -- reißfest /ˈʁaɪ̯sˌfɛst/, säurefest /ˈzɔɪ̯ʁəˌfɛst/ (would be same if regular stress), schossfest -- /ˈʃɔsˌfɛst/, wasserfest /ˈvasɐˌfɛst/ (would be same if regular stress), witterungsfest /ˈvɪtəʁʊŋsˌfɛst/ -- (would be same if regular stress) {"fest", "--fest", pos = "a"}, -- can follow a vowel as in säurefest -- Examples: akzentfrei /akˈt͡sɛntˌfʁaɪ̯/, alkoholfrei /alkoˈhoːlˌfʁaɪ̯/, apothekenfrei /apoˈteːkənˌfʁaɪ̯/ -- (would be same if regular stress), bleifrei /ˈblaɪ̯ˌfʁaɪ̯/, bündnisfrei /ˈbʏntnɪsˌfʁaɪ̯/ (would be same if	-- regular stress), einwandfrei /ˈaɪ̯nvantˌfʁaɪ̯/, erschütterungsfrei /ɛʁˈʃʏtəʁʊŋsˌfʁaɪ̯/ (would be same if	-- regular stress), gastfrei /ˈɡastˌfʁaɪ̯/, gemeinfrei /ɡəˈmaɪ̯nˌfʁaɪ̯/, glutenfrei /ɡluˈteːnˌfʁaɪ̯/, -- holzschlifffrei /ˈhɔlt͡sʃlɪfˌfʁaɪ̯/, keimfrei /ˈkaɪ̯mˌfʁaɪ̯/, kontextfrei /ˈkɔntɛkstˌfʁaɪ̯/ -- (would be same if regular stress), niederschlagsfrei /ˈniːdɐʃlaːksˌfʁaɪ̯/, nikotinfrei /nikoˈtiːnfʁaɪ̯/, -- rechtsfrei /ˈʁɛçt͡sˌfʁaɪ̯/, säurefrei /ˈzɔɪ̯ʁəˌfʁaɪ̯/ (would be same if regular stress), schadstofffrei -- /ˈʃaːtʃtɔfˌfʁaɪ̯/, schneefrei /ˈʃneːˌfʁaɪ̯/, schulfrei /ˈʃuːlˌfʁaɪ̯/, steuerfrei /ˈʃtɔɪ̯ɐˌfʁaɪ̯/, -- straffrei /ˈʃtʁaːfˌfʁaɪ̯/, stressfrei /ˈʃtʁɛsˌfʁaɪ̯/, unfallfrei /ˈʊnfalˌfʁaɪ̯/, versandkostenfrei -- /fɛʁˈzantkɔstənˌfʁaɪ̯/, vibrationsfrei /vibʁaˈt͡si̯oːnsˌfʁaɪ̯/, visafrei /ˈviːzaˌfʁaɪ̯/ (would be same if	-- regular stress) {"frei", "--frei", pos = "a"}, -- Examples: Arbeitslosigkeit /ˈaʁbaɪ̯t͡sˌloːzɪçkaɪ̯t/ (would be same if regular suffix), Arglosigkeit -- /ˈaʁkˌloːzɪçkaɪ̯t/, Ausnahmslosigkeit /ˈaʊ̯snaːmsloːzɪçkaɪ̯t/, Bedeutungslosigkeit /bəˈdɔɪ̯tʊŋsˌloːzɪçkaɪ̯t/ -- (would be same if regular suffix), Charakterlosigkeit /kaˈʁaktɐˌloːzɪçkaɪ̯t/ (would be same if regular	-- suffix), Gefühllosigkeit /ɡəˈfyːlˌloːzɪçkaɪ̯t/, Jugendarbeitslosigkeit /ˈjuːɡəntˌʔaʁbaɪ̯t͡sloːzɪçkaɪ̯t/ -- (FIXME: our rules produce ˈjuːɡəntʔaʁbaɪ̯t͡sˌloːzɪçkaɪ̯t), Obdachlosigkeit /ˈɔpdaxˌloːzɪçkaɪ̯t/ (FIXME: dewikt	-- and enwikt have no secondary stress), Pietätlosigkeit /ˌpiːəˈtɛːtlozɪçˌkaɪ̯t/ (FIXME: our rules produce	-- ˌpiːəˈtɛːtˌlozɪçkaɪ̯t), Reglosigkeit /ˈʁeːkˌloːzɪçkaɪ̯t/ (FIXME: dewikt and enwikt have no secondary stress), -- Ruchlosigkeit /ˈʁuːxˌloːzɪçkaɪ̯t/ or /ˈʁʊxˌloːzɪçkaɪ̯t/, Rücksichtslosigkeit /ˈʁʏkzɪçt͡sˌloːzɪçkaɪ̯t/, -- Schlaflosigkeit /ˈʃlaːfloːzɪçkaɪ̯t/ (FIXME: dewikt and enwikt have no secondary stress), -- Teilnahmslosigkeit /ˈtaɪ̯lnaːmsˌloːzɪçkaɪ̯t/, Willenlosigkeit /ˈvɪlənˌloːsɪçkaɪ̯t/ (would be same if	-- regular stress), Zügellosigkeit /ˈt͡syːɡəlˌloːzɪçkaɪ̯t/ (would be same if regular stress) {"losigkeit", "--losigkèit", pos = "n"}, -- Examples: alternativlos /ˌaltɐnaˈtiːfˌloːs/, anstandslos /ˈanʃtant͡sˌloːs/, arglos /ˈaʁkˌloːs/, -- atemlos /ˈaːtəmˌloːs/ (would be same if regular suffix), ausdruckslos /ˈaʊ̯sdʁʊksˌloːs/, ausweglos -- /ˈaʊ̯sveːkˌloːs/, bargeldlos /ˈbaːʁɡɛltˌloːs/, bedingungslos /bəˈdɪŋʊŋsˌloːs/ (would be same if regular	-- suffix), besitzlos /bəˈzɪt͡sˌloːs/ (FIXME: enwikt but not dewikt have secondary stress), charakterlos -- /kaˈʁaktɐˌloːs/ (would be same if regular suffix), einflusslos /ˈaɪ̯nflʊsˌloːs/, ersatzlos /ɛʁˈzat͡sˌloːs/, -- erwerbslos /ɛʁˈvɛʁpsˌloːs/, fraglos /ˈfʁaːkˌloːs/ (FIXME: dewikt and enwikt have no secondary stress), -- furchtlos /ˈfʊʁçtˌloːs/, gefühllos /ɡəˈfyːlˌloːs/, gesichtslos /ɡəˈzɪçt͡sˌloːs/, gottlos -- /ˈɡɔtˌloːs/, haarlos /ˈhaːʁˌloːs/, inhaltslos /ˈɪnhalt͡sˌloːs/, konkurrenzlos /kɔŋkʊˈʁɛnt͡sˌloːs/, -- kontrolllos /kɔnˈtʁɔlˌloːs/, kopflos /ˈkɔp͡fˌloːs/, leidenschaftslos /ˈlaɪ̯dənʃaft͡sˌloːs/, -- merkmallos /ˈmɛʁkmaːlˌloːs/, papierlos /paˈpiːʁˌloːs/, planlos /ˈplaːnˌloːs/, reglos -- /ˈʁeːkˌloːs/, schnurlos /ˈʃnuːʁˌloːs/, stillos /ˈʃtiːlˌloːs/ or /ˈstiːlˌloːs/, systemlos -- /zʏsˈteːmˌloːs/, tonlos /ˈtoːnˌloːs/, verständnislos /fɛʁˈʃtɛntnɪsˌloːs/ (would be same if regular	-- prefix), zahnlos /ˈt͡saːnˌloːs/, zwanglos /ˈt͡svaŋˌloːs/ {"los", "--los", pos = "a"}, -- Not necessary; we split off -reich in a first pass, and then -nis will be recognized. -- {"nisreich", ">nis-reich", restriction = C .. "$", pos = "a"}, -- Examples: anregungsreich /ˈanʁeːɡʊŋsˌʁaɪ̯ç/, einfallsreich /ˈaɪ̯nfalsˌʁaɪ̯ç/, einwohnerreich -- /ˈaɪ̯nvoːnɐˌʁaɪ̯ç/, geistreich /ˈɡaɪ̯stˌʁaɪ̯ç/, glorreich /ˈɡloːʁˌʁaɪ̯ç/, kohlenhydratreich -- /ˈkoːlənhydʁaːtˌʁaɪ̯ç/, nährstoffreich /ˈnɛːʁʃtɔfˌʁaɪ̯ç/, niederschlagsreich /ˈniːdɐʃlaːksˌʁaɪ̯ç/, -- ruhmreich /ˈʁuːmˌʁaɪ̯ç/, schneereich /ˈʃneːˌʁaɪ̯ç/, siegreich /ˈziːkˌʁaɪ̯ç/, tugendreich -- /ˈtuːɡəntˌʁaɪ̯ç/ (would be same if regular suffix), verlustreich /fɛʁˈlʊstˌʁaɪ̯ç/ {"reich", "--reich", pos = "a"}, -- Not necessary; we split off -voll in a first pass, and then -nis will be recognized. -- {"nisvoll", "nisfòll", restriction = C .. "$", pos = "a"}, -- Examples: eindrucksvoll /ˈaɪ̯ndʁʊksˌfɔl/, gefühlvoll /ɡəˈfyːlˌfɔl/, geheimnisvoll /ɡəˈhaɪ̯mnɪsˌfɔl/ -- (would be same if regular suffix), geräuschvoll /ɡəˈʁɔɪ̯ʃˌfɔl/, gramvoll /ˈɡʁaːmˌfɔl/ (NOTE: enwikt has	-- no secondary stress here), humorvoll /huˈmoːʁˌfɔl/, klangvoll /ˈklaŋˌfɔl/, kraftvoll /ˈkʁaftˌfɔl/, -- maßvoll /ˈmaːsˌfɔl/, qualvoll /ˈkvaːlˌfɔl/, randvoll /ˈʁantˌfɔl/, respektvoll /ʁeˈspɛktˌfɔl/, -- rücksichtsvoll /ˈʁʏkzɪçt͡sˌfɔl/, unheilvoll /ˈʊnhaɪ̯lˌfɔl/, unschuldsvoll /ˈʊnʃʊlt͡sˌfɔl/, -- verantwortungsvoll /fɛʁˈʔantvɔʁtʊŋsˌfɔl/, wundervoll /ˈvʊndɐˌfɔl/ (would be same if regular suffix) {"voll", "--foll", pos = "a"}, -- Examples: abschnittweise /ˈapʃnɪtˌvaɪ̯zə/, allerleiweise /ˈalɐlaɪ̯ˌvaɪ̯zə/, ansatzweise /ˈanzat͡sˌvaɪ̯zə/, -- ausnahmsweise /ˈaʊ̯snaːmsˌvaɪ̯zə/, beispielsweise /ˈbaɪ̯ʃpiːlsˌvaɪ̯zə/, esslöffelweise /ˈɛslœfəlˌvaɪ̯zə/, -- fallweise /ˈfalˌvaɪ̯zə/, haufenweise /ˈhaʊ̯fənˌvaɪ̯zə/ (would be same if regular suffix), leihweise -- /ˈlaɪ̯ˌvaɪ̯zə/, probeweise /ˈpʁoːbəˌvaɪ̯zə/ (would be same if regular suffix), quartalsweise -- /kvaʁˈtaːlsˌvaɪ̯zə/, scheibchenweise /ˈʃaɪ̯bçənˌvaɪ̯zə/, stückchenweise /ˈʃtʏkçənˌvaɪ̯zə/, stückweise -- /ˈʃtʏkˌvaɪ̯zə/, teilweise /ˈtaɪ̯lˌvaɪ̯zə/, versuchsweise /fɛʁˈzuːxsˌvaɪ̯zə/, zwangsweise /ˈt͡svaŋsˌvaɪ̯zə/ {"weise", "--weise", pos = "a"}, }

- Recognized regular-stressing suffixes -

--[=[ The following contains suffixes that have regular suffix stress. That is, if stressed with secondary stress, they take secondary stress only when the preceding syllable has no stress and there is no preceding secondary stress, e.g. for -lein in Fingerlein, Schwesterlein or Müllerlein /ˈmʏlɐˌlaɪ̯n/. In most words, this condition doesn't hold, and so e.g. -lein has no stress, e.g. Äuglein /ˈɔɪ̯klaɪ̯n/ or Bäumlein /ˈbɔɪ̯mlaɪ̯n/. This includes secondary stress of the type found in Ecklädlein /ˈɛkˌlɛːtlaɪn/, Hofkirchlein /ˈhoːfˌkɪʁçlaɪ̯n/, Apfelbäumlein /ˈap͡fəlˌbɔɪ̯mlaɪ̯n/. This is contrary to the behavior of compounds-of-compounds like Hubschrauberlandeplatz and Maulwurfshügel described above; by that rule, we'd expect #/ˈap͡fəlbɔɪ̯mˌlaɪ̯n/ or similar. Cf. similarly Abhängigkeit /ˈapˌhɛŋɪçkaɪ̯t/.

The format of each entry is as for component-like suffixes above.

Note that vowel-initial suffixes are joined phonemically directly onto the preceding part, without any sort of morpheme boundary. As a result, vowel-initial suffixes that bear no stress and require no respelling, such as "ung" and "isch", do not need to be given here. ]=]

local suffixes = { {"ant", "ánt", pos = {"n", "a"}}, {"anz", "ánz", pos = "n"}, {"abel", "ábel", pos = "a"}, {"ibel", "íbel", pos = "a"}, -- I considered an exception for -mal but there are many counter-exceptions like normal, minimal, -- dermal, dezimal, prodromal. {"al", "ál", pos = "a"}, {"ierbar", "íerbàr", pos = "a"}, -- Normally following consonant but there a few exceptions like abbaubar, recyclebar, unüberschaubar. -- Cases like isobar will be respelled with an accent and not affected. {"bar", "bàr", pos = "a"}, -- Restrict to not follow a vowel or s (except for -ss) to avoid issues with nominalized infinitives in -chen. -- Words with -chen after a vowel or s need to be respelled with '>chen', as in Frauchen, Wodkachen, -- Häuschen, Bläschen, Füchschen, Gänschen, etc. Occasional gerunds of verbs in -rchen may need -- to be respelled with '+chen' to avoid the preceding vowel being long, as in Schnarchen. {"chen", "çen", pos = "n", restriction = {C_not_s .. "$", "ss$"}}, {"erei", "əréi", pos = "n", restriction = C .. "$"},	{"ei", "éi", pos = "n", restriction = C .. "$"},	{"ent", "ént", pos = {"n", "a"}}, {"enz", "énz", pos = "n"}, -- -haft is down below -schaft {"heit", "hèit", pos = "n"}, {"ie", "íe", restriction = C .. "$", pos = "n"}, -- No restriction to not occur after a/e; kreieren does occur and noun form and verbs in -en after -aier/-eier -- should not occur (instead you get -aiern/-eiern). Occurs with both nouns and verbs. {"ieren", "íeren", pos = "v"}, -- See above. Occurs with adjectives and participles, and also Gefiert. {"iert", "íert", pos = "a"}, -- See above. Occurs with nouns. {"ierung", "íerung", pos = "n"}, -- "isch" not needed here; no respelling needed and vowel-initial {"tionismus", "zionísmus", restriction = "[^s]$", pos = "n"}, {"ismus", "ísmus", pos = "n"}, -- Restrict to not occur after -a or -e as it may form a diphthong (allermeist, verwaist, etc.). Words -- with suffix -ist after -a or -e need respelling, e.g. Judaist, Atheist, Deist, Monotheist. {"ist", "íst", respelling = "[^ae]$", pos = "n"}, {"istisch", "ístisch", respelling = "[^ae]$", pos = "a"}, {"iv", "ív", pos = {"n", "a"}}, -- This entry causes unstressed -iv to still be lengthened. The previous entry won't apply when the main part has -- primary stress, in which case this entry applies. {"iv", "ihv", pos = {"n", "a"}}, {"ierbarkeit", "íerbàhrkèit", pos = "n"}, {"barkeit", "bàhrkèit", pos = "n"}, {"schaftlichkeit", "schàft.lichkèit", pos = "n"}, {"lichkeit", "lichkèit", pos = "n"}, -- FIXME! Allow two replacement specs. -- {"samkeit", {"sahmkèit", "samkèit"}, pos = "n"}, {"samkeit", "sahmkèit", pos = "n"}, {"keit", "kèit", pos = "n"}, -- See comment above about secondary stress. {"lein", "lèin", pos = "n"}, {"barlich", "bàhrlich", pos = "a"}, {"schaftlich", "schàft.lich", pos = "a"}, {"lich", "lich", pos = "a"}, -- Restrict to not be after -l (cf. Drilling, Helling, Marshalling, Schilling, Spilling,	-- Zwilling). Instances after vowels do occur (Dreiling, Edeling, Neuling, Reling); words -- Feeling, Homeschooling need respelling in any case. {"ling", "ling", restriction = "[^l]$", pos = "n"}, -- Included because of words like Ergebnis, Erlebnis, Befugnis, Begräbnis, Betrübnis, -- Gelöbnis, Ödnis, Verlöbnis, Wagnis, etc. Also Zeugnis with /k/ (syllable division 'g.n') -- instead of expected syllable division '.gn'. Only recognized when following a consonant to exclude Anis, -- Denis, Penis, Spiritus lenis, Tunis (although these would be excluded in any case for the	-- pre-suffix part being too short). Tennis needs respelling 'Ten+nis'. {"nis", "nis", restriction = C .. "$", pos = "n"}, {"or", "ohr", pos = "n"}, {"ös", "ö́s", pos = "a"}, -- Two possible pronunciations (long and short). Occurs after a vowel in grausam (also false positives	-- Bisam, Sesam, which will be excluded as the pre-suffix part is too short). -- FIXME! Allow two replacement specs. -- {"sam", {"sahm", "sam"}, pos = "a"}, {"sam", "sahm", pos = "a"}, -- FIXME: Is the secondary stress correct? It occurs in some words in enwikt with an intervening unstressed -- syllable, as in Bauerschaft, Leidenschaft, Liegenschaft, Mutterschaft, Schwägerschaft, -- Täterschaft, Wissenschafterin, Wissenschaftlerin, Witwenschaft, Zeugenschaft, but not -- in many similar words, e.g. Bauernschaft, Eigenschaft, Elternschaft, Errungenschaft, -- Hundertschaft, Komplizenschaft, Leserschaft, Partnerschaft, Priesterschaft, -- Rechenschaft, Richterschaft, Ritterschaft, Schwangerschaft, Vaterschaft, Völkerschaft, -- Wählerschaft, Wanderschaft, Wissenschaft. In dewikt, of the above words, Mutterschaft and -- Schwägerschaft (from the former list) and Elternschaft and Vaterschaft (from the latter list) are -- the only ones with secondary stress on -schaft. {"schaft", "schàft", pos = "n"}, -- Examples: beispielhaft /ˈbaɪ̯ˌʃpiːlhaft/ (FIXME: dewikt and enwikt have no secondary stress but audio sounds	-- more like our rendering), engelhaft /ˈɛŋəlˌhaft/, fabelhaft /ˈfaːbəlˌhaft/, gebresthaft -- /ɡəˈbʁɛsthaft/, glaubhaft /ˈɡlaʊ̯phaft/, habhaft /ˈhaːphaft/, kometenhaft /koˈmeːtənˌhaft/, -- mädchenhaft /ˈmɛːtçənhaft/, namhaft /ˈnaːmhaft/, rechtsfehlerhaft /ˈʁɛçt͡sˌfeːlɐhaft/, -- unstatthaft /ˈʊnˌʃtathaft/, vorbildhaft /ˈfoːʁˌbɪlthaft/ {"haft", "hàft", pos = "a"}, -- Almost all words in -tät are in -ität but a few aren't: Majestät, Fakultät, Pietät, Pubertät, -- Sozietät, Varietät. Unlike most other suffixes, -tät after a consonant does not result in the -- preceding vowel being pronounced close. Cf. Fakultät /fakʊlˈtɛːt/. {"tät", "tä́t", pos = "n"}, -- Unlike most other suffixes, -tion after a consonant does not result in the preceding vowel being pronounced -- close. Cf. Produktion /pʁodʊkˈt͡si̯oːn/, Konvention /kɔnvɛnˈt͡si̯oːn/. {"tion", "zión", restriction = "[^s]$", pos = "n"}, -- This must follow -tion. Most words in -ion besides those in -tion are still end-stressed abstract nouns, e.g.	-- Religion, Version, Union, Vision, Explosion, Aggression, Rebellion, or other words -- with the same stress pattern, e.g. Skorpion, Million, Fermion. There are several in -ion that are -- various types of ions, e.g. Cadiumion, Hydridion, Gegenion, which need respelling with a hyphen, and -- some miscellaneous words that aren't end-stressed, e.g. Amnion, Champion, Camion, Ganglion, -- needing respelling of various sorts. {"ion", "ión", restriction = C .. "$", pos = "n"}, -- "ung" not needed here; no respelling needed and vowel-initial }

-- Convert `text` into a string where Unicode characters >= 0xFFF0 are displayed in hex format. local function reveal_unicode(text) local parts = {} for codepoint in mw.ustring.gcodepoint(text) do		if codepoint >= 0xFFF0 then table.insert(parts, ("\\U+%X"):format(codepoint)) else table.insert(parts, u(codepoint)) end end

return table.concat(parts) end

local function reorder_accents(text) -- The order should be: (1) DOTUNDER (removed early), (2) */INVBREVEBELOW, (3) TILDE, (4) BREVE/CFLEX, (5) MACRON/ː, -- (6) ˈ and ˌ, (7) ACUTE/GRAVE/DOUBLEGRAVE. local function reorder_accent_string(accentstr) local accents = rsplit(accentstr, "") local accent_order = { [DOTUNDER] = 1, ["*"] = 2,			[INVBREVEBELOW] = 2, [TILDE] = 3, [BREVE] = 4, [CFLEX] = 4, [MACRON] = 5, ["ː"] = 5, ["ˈ"] = 6, ["ˌ"] = 6, [ACUTE] = 7, [GRAVE] = 7, [DOUBLEGRAVE] = 7, }		table.sort(accents, function(ac1, ac2)			return accent_order[ac1] < accent_order[ac2]		end) return table.concat(accents) end -- IPA stress marks as given by the user should precede a vowel; make them follow so they become part of the set of -- accents following a vowel. text = rsub(text, "([ˈˌ])(" .. V .. ")", "%2%1")	text = rsub(text, "(" .. accent_c .. "+)", reorder_accent_string) -- Remove duplicate accents. text = rsub_repeatedly(text, "(" .. accent_c .. ")%1", "%1")	return text end

-- Apply canonical Unicode decomposition to text, e.g. è → e + ◌̀. But recompose ä ö ü so we can treat them as single -- vowels, and recompose ç for use in respelling. local function decompose(text) text = mw.ustring.toNFD(text) text = rsub(text, "." .. "[" .. DIA .. CEDILLA .. "]", {		["a" .. DIA] = "ä",		["A" .. DIA] = "Ä",		["o" .. DIA] = "ö",		["O" .. DIA] = "Ö",		["u" .. DIA] = "ü",		["U" .. DIA] = "Ü",		["c" .. CEDILLA] = "ç",		["C" .. CEDILLA] = "Ç",	}) return text end

-- Decompose the text, canonicalize in various ways, lowercase and split into words, returning each word along with -- whether the word was initially capitalized. -- -- The canonicalization does the following: -- * remove leading, trailing and multiple spaces -- * remove exclamation points, question marks and periods at end of sentence (elsewhere they become foot boundaries) -- * convert commas and em/en dashes to foot boundaries -- * handle special uses of capital letters (N = nasalization, O/U/I/Y after consonant = high glide, I/U after vowel = --  near-high glide) -- * convert macrons to long marks -- * reorder accents appropriately local function canonicalize_and_split_words(text) text = decompose(text) text = rsub(text, "%s+", " ") text = rsub(text, "^ ", "") text = rsub(text, " $", "") -- Capital N after a vowel (including after vowel + accent marks + possibly an h) denotes nasalization. text = rsub(text, "(" .. V .. accent_c .. "*)(h?)N", "%1" .. TILDE .. "%2") -- The user can use respelling with macrons but internally we convert to the long mark ː. text = rsub(text, MACRON, "ː") -- Reorder so ACUTE/GRAVE/DOUBLEGRAVE go last. text = reorder_accents(text)

-- convert commas and em/en dashes to IPA foot boundaries text = rsub_repeatedly(text, "%s*[–—]%s*", " | ") -- comma must be followed by a space; otherwise it might denote absolute secondary stress text = rsub_repeatedly(text, "%s*,%s", " | ") -- period, question mark, exclamation point in the middle of a sentence or end of a non-final sentence -> IPA foot -- boundary; there must be a space after the punctuation, as we use ! and ? in component-separation indicators to -- control the production of glottal stops at the beginning of the next word. text = rsub_repeatedly(text, "([^%s])%s*[!?.]%s([^%s])", "%1 | %2") text = rsub(text, "[!?.]$", "") -- eliminate absolute phrase-final punctuation

local result = {} for word in rgsplit(text, " ") do -- Lowercase the word and check if it was capitalized. local lcword = mw.getContentLanguage:lcfirst(word) local is_cap = lcword ~= word -- capital vowel between consonant and vowel represents a glide o̯/u̯/i̯/y̯ word = rsub(word, "(" .. C .. "%.?)([OUIY])(" .. V .. ")", function(c, cap_v, v)			return c .. ulower(cap_v) .. INVBREVEBELOW .. v		end) -- Capital I/U after another vowel represents a near-high glide ɪ̯/ʊ̯; but conversion to these symbols happens -- late, so we need to maintain the I/U when lowercasing the word. word = rsub(word, "(" .. V .. accent_c .. "*)I", "%1" .. TEMP_I) word = rsub(word, "(" .. V .. accent_c .. "*)U", "%1" .. TEMP_U) word = ulower(word) word = rsub(word, TEMP_I, "I") word = rsub(word, TEMP_U, "U") table.insert(result, {word = ulower(word), is_cap = is_cap}) end return result end

-- Given a respelling (typically originating from one of the lists of known prefixes or suffixes), replace ACUTE/GRAVE -- with their autogenerated equivalent characters. The reason for doing this is that in the affix list, we specify the -- respelling with acute/grave accents for convenience, but ultimately when we substitute the affix respelling for the -- original affix spelling we need to mark the accents as autogenerated since the user didn't explicitly specify the -- accents. local function replace_stress_with_auto(respelling) respelling = gsub(respelling, ACUTE, AUTOACUTE) respelling = gsub(respelling, GRAVE, AUTOGRAVE) return respelling end

-- Check a user-specified affix `affix` against a known list of affixes in `affix_specs`. If the affix is found, return -- the corresponding respelling given in `affix_specs`. This respelling is originally given in `affix_specs` with acute -- and grave accents marking primary and secondary stress; decompose and convert the accents to their auto-variants -- (ACUTE -> AUTOACUTE, GRAVE -> AUTOGRAVE), since the user didn't originally specify them. Furthermore, if -- `replace_stress_with_double_grave` is given, convert acute and grave accents to double-grave (tertiary stress), -- so that stressed vowels are still lengthened in open syllables but no visible stress marker is included. -- -- In addition, for affixes in `affix_specs` that have primary stress, if the user-specified affix includes secondary -- stress on the stressed syllable but otherwise matches, return the corresponding respelling with primary stress -- (acute) replaced with secondary stress (grave). Here, we do not convert to AUTOGRAVE because the secondary stress -- was explicitly given by the user. -- -- If no affixes match, return nil. local function check_for_affix_respelling(affix, affix_specs, replace_stress_with_double_grave) for _, spec in ipairs(affix_specs) do		if affix == spec[1] then local respelling = decompose(spec[2]) respelling = replace_stress_with_auto(respelling) if replace_stress_with_double_grave then -- The user didn't request stress, so replace stress marks with double-grave, which preserves length -- in originally stressed syllables (e.g. in über-). respelling = rsub(respelling, stress_c, DOUBLEGRAVE) end return respelling end end

-- Also check for secondary-stress variant. for _, spec in ipairs(affix_specs) do		local respelling = decompose(spec[2]) if rfind(respelling, ACUTE) then local secstressed_affix = spec.secstress or rsub(spec[1], "^(.-" .. V .. accent_c .. "*)", "%1" .. GRAVE) if affix == secstressed_affix then return rsub(respelling, ACUTE, GRAVE) end end end

return nil end

-- Check if `cluster` (a consonant cluster) matches any of the patterns in `patterns`, which are intended to contain -- allowable onsets or offsets. Matching is anchored on both ends. local function check_onset_offset(cluster, patterns) for _, pattern in ipairs(patterns) do if rfind(cluster, "^" .. pattern .. "$") then return true end end return false end

-- Check text against a pattern restriction. If it matches, return true; otherwise, return false. If the restriction -- is a list of patterns, return true if any of them matches. local function meets_restriction(rest, restriction) if restriction == nil then -- no restriction. return true end if type(restriction) == "table" then -- If any of the restrictions pass, the affix is restricted. for _, restrict in ipairs(restriction) do			if rfind(rest, restrict) then return true end end else if rfind(rest, restriction) then return true end end return false end

-- "Demote" grave to double-grave, and maybe acute to grave. local function demote_stress(part, demote_acute) part = rsub(part, "[" .. GRAVE .. AUTOGRAVE .. ORIG_SUFFIX_GRAVE .. "]", DOUBLEGRAVE) if demote_acute then part = rsub(part, AUTOACUTE, AUTOGRAVE) part = rsub(part, ACUTE, GRAVE) end return part end

-- Add `accent_to_add` to the first vowel of `mainpart`. But don't add it to initial 'y' + vowel, as in Yacht, and -- don't add it to a vowel indicated as nonsyllabic, e.g. Pointe respelled 'POẽte' or 'Po̯ẽte'. local function add_stress_to_mainpart(mainpart, accent_to_add) local changed -- First check for nonsyllabic vowel as first vowel, and add to the following vowel. mainpart, changed = rsubb(mainpart, "^(" .. non_V .. "*" .. V .. INVBREVEBELOW .. non_V .. "*" .. V .. accent_c .. "*)",		"%1" .. accent_to_add) if changed then return mainpart end -- Then check for initial y + vowel. mainpart, changed = rsubb(mainpart, "^(y" .. V .. accent_c .. "*)", "%1" .. accent_to_add) if changed then return mainpart end -- Finally add to first vowel (if any). return rsub(mainpart, "^(" .. non_V .. "*" .. V .. accent_c .. "*)", "%1" .. accent_to_add) end

-- Split a word into components and split off any prefixes or suffixes. On entry, components are separated by '-' or -- '--'. Prefixes may be explicitly notated using '<' and suffixes using '>'. A component may be prefixed with '*' to -- indicate it takes primary stress and '**' to indicate secondary stress; otherwise, the first component takes -- primary stress and all others take secondary stress. On exit, (1) stress marks are added to components and affixes -- as appropriate; (2) component boundaries are indicated by ⁀; (3) prefix-prefix and prefix-mainpart boundaries are -- indicated by ⁀; (4) suffix-suffix and mainpart-suffix boundaries are indicated by ‿; (5) word boundaries are -- indicated by ⁀⁀. -- -- The following parameters are used: -- * `word` is the word to split. -- * `pos` is the "part of speech" (currently underused). -- * `affix_type` is "prefix" if the word itself is a prefix (indicated by '-' at the end of the word), "suffix" if --  the word itself is a suffix (indicated by '-' at the beginning of the word), else nil. -- * `depth` is used internally to handle components separated by '-' and '--': at depth 0 (or nil), we split on '--' --  and recursively process each component; at depth 1, we further split on '-' and recursively process each --  component; at depth 2 we do the actual work of handling prefixes and suffixes and assigning stress. -- * `is_compound` is true when this function is called recursively and there is more than one component. In this case, --  original secondary stress is demoted to tertiary (double-grave) stress (which causes vowels in open syllables to --   lengthen, as with primary and secondary stress, but does not display with an explicit stress mark), and original --  primary stress is demoted to secondary stress if the component does not carry primary stress. local function split_word_on_components_and_apply_affixes(word, pos, affix_type, depth, is_compound) depth = depth or 0

-- First check for component-like suffixes. An example is '-los', which, from the point of view of stress, behave -- similarly to a separate component; cf. Ausdruck 'Áusdrùck' but ausdrucklos 'áusdrucklòs' as if it were -- 'ausdruck-los'. Normal suffix behavior would lead to #'áusdrùcklos'. if depth == 0 then local has_double_dash = rfind(word, "%-%-")

local function respell_respelling(respelling) if has_double_dash then respelling = rsub(respelling, "%-%-", "-") end return respelling end

local components = strutils.capturing_split(word, "(%-%-?)") for i, component in ipairs(components) do			if i % 2 == 1 then -- component, not separator local parts = strutils.capturing_split(component, "([<>]+)") local j = #parts while j >= 1 do					if j > 1 and parts[j - 1] == ">>" then -- suffix local respelling = check_for_affix_respelling(parts[j], component_like_suffixes) if respelling then parts[j] = respell_respelling(respelling) parts[j - 1] = "" end else -- FIXME! Handle secondary-stressed suffixes. for _, suffixspec in ipairs(component_like_suffixes) do							local suffix_pattern = suffixspec[1] local rest = rmatch(parts[j], "^(.-)" .. suffix_pattern .. "$") if rest then if not meets_restriction(rest, suffixspec.restriction) then -- restriction not met, don't split here elseif rfind(rest, "%+$") then -- explicit non-boundary here, so don't split here elseif not rfind(rest, V) then -- no vowels, don't split here else -- Use non_V so that we pick up things like explicit syllable divisions, which we -- check for below. local before_cluster, final_cluster = rmatch(rest, "^(.-)(" .. non_V .. "*)$")									if rfind(final_cluster, "%..") then -- syllable division within or before final cluster, don't split here else parts[j] = rest .. respell_respelling(suffixspec[2]) break end end end end end j = j - 2 end components[i] = table.concat(parts) end end word = table.concat(components) end

-- If at depth 0, split on --, recursively process the parts, and combine. Similarly, at depth 1, split on -, -- recursively process the parts, and combine. At depth 2 we do the actual work. if depth == 0 or depth == 1 then local parts = rsplit(word, depth == 0 and "%-%-" or "%-") if #parts == 1 then return split_word_on_components_and_apply_affixes(word, pos, affix_type, depth + 1, is_compound) else -- Figure out which components bear primary stress. We check for * or + before a component, which indicates -- primary stress. If * is used, the component stress "propagates" up to top level. For example, in			-- a compound of the form A-B--C-*D, subcomponent D gets primary stress and component C-D also gets primary -- stress, whereas in a compound of the form A-B--C-+D, subcomponent D gets primary stress but component -- C-D gets secondary stress, with the primary component stress on A-B (meaning that relative primary			-- stress on D ends up as absolute secondary stress, whereas relative primary stress on A ends up as			-- absolute primary stress and relative primary stress on B and C end up as absolute tertiary stress). local stresses = {} -- Did we see explicit stress on a non-initial component? If so, the first component doesn't get primary -- stress. local saw_non_initial_stress = false -- Did we see explicit '*'? If so, propagate it up so the higher-level component also bears stress. local saw_explicit_propagating_stress = false

-- First, pull out * and + indicators. for i, part in ipairs(parts) do				local rest local stress_marker, rest = rmatch(part, "^([*+])(.-)$") if stress_marker then stresses[i] = true parts[i] = rest if i > 1 then saw_non_initial_stress = true end end if stress_marker == "*" then saw_explicit_propagating_stress = true end end

-- Recursively compute the stress. for i, part in ipairs(parts) do				local saw_explicit_stress parts[i], saw_explicit_stress = split_word_on_components_and_apply_affixes(part, pos, affix_type, depth + 1, "is compound") if saw_explicit_stress then -- * was used in a non-initial subcomponent. The component containing this subcomponent also -- gets primary stress. stresses[i] = true if i > 1 then saw_non_initial_stress = true end saw_explicit_propagating_stress = true end end

-- Initial component gets auto-stressed unless another component was explicitly marked as such. if not saw_non_initial_stress then stresses[1] = true end

-- Now "demote" the component stresses. Secondary stress becomes tertiary, and primary stress becomes -- secondary unless the component is marked for primary stress. for i, part in ipairs(parts) do				parts[i] = demote_stress(part, not stresses[i]) end

-- Finally, put the components together. return table.concat(parts, "⁀"), saw_explicit_propagating_stress end end

-- Interaction between stresses in prefix/mainpart/suffix: -- 1. By default, a normally stressed suffix like -ieren takes secondary stress if there is a stressed prefix. -- Cf. ausprobieren /ˈʔaʊ̯spʁoˌbiːʁən/, ausstaffieren /ˈaʊ̯sʃtaˌfiːʁən/, aufaddieren /ˈaʊ̯fʔaˌdiːʁən/, -- aufmarschieren /ˈaʊ̯fmaʁˌʃiːʁən/, anprobieren /ˈanpʁoˌbiːʁən/, anlegieren /ˈanleˌɡiːʁən/, -- überdosieren /ˈyːbɐdoˌziːʁən/, überreagieren /ˈyːbɐʁeaˌɡiːʁən/. -- 2. If the mainpart is given an explicit primary stress, sometimes a prefix gets secondary stress, sometimes not. -- We handle this by assuming a prefix has no stress in this situation. If the prefix needs secondary stress, add -- it explicitly; we will still recognize the prefix provided it is stressable (e.g. vor- but not ver-, which is	-- always unstressed). -- 3. If a suffix is given explicit primary stress, the same rules apply as in (2).

local retparts = {} local parts = strutils.capturing_split(word, "([<>])") -- The type of the preceding prefix. Used to implement a finite state machine to track allowable combinations of -- prefixes. local previous_prefixtype = nil -- Have we seen stressed un- previously? If so, a following stressed suffix loses its stress, while the main part -- gets secondary stress, cf. unausgegoren 'únausgegòren'. local saw_primary_un_stress = false -- Have we seen a stressed prefix previously (user or auto)? If so, the main part gets secondary stress. local saw_primary_prefix_stress = false -- Have we seen two primary stressed prefixes (user or auto), as in überbeanspruchen 'ǘberbeànspruchen'? If so, the -- main part loses its stress. local saw_double_primary_prefix_stress = false -- Have we seen a primary-stressed suffix like -anz or -ieren (user or auto)? If so, the main part loses its stress. local saw_primary_suffix_stress = false -- Have we seen a suffix with explicit user-specified primary stress? If so, prefixes get secondary stress. local saw_user_primary_suffix_stress = false

local function replace_part_with_multiple_parts(new_parts, inspos, separator) -- Replace the original part that the new parts were derived from. parts[inspos] = new_parts[1] local i = 2 for i=2, #new_parts do			inspos = inspos + 1 table.insert(parts, inspos, separator) inspos = inspos + 1 table.insert(parts, inspos, new_parts[i]) end end

local function has_user_specified_primary_stress(part) -- If there are multiple components (separated by - or --), we want to treat explicit user-specified -- absolute secondary stress like primary stress because we only show the component primary stresses. -- The overall word primary stress shows as ˈ and other component primary stresses show as ˌ. return rfind(part, "[" .. ACUTE .. "ˈ]") or is_compound and rfind(part, "ˌ") end

-- FIXME! Handle absolute stresses using ˈ and ˌ. -- FIXME! Clarify why we still need AUTOACUTE and AUTOGRAVE when they are relative.

-- Break off any explicitly-specified prefixes. local from_left = 1 -- FIXME! Recognize <<. while from_left < #parts and parts[from_left + 1] == "<" do		local prefix = parts[from_left] local unstressed_prefix = rsub(prefix, stress_c, "") if unstressed_prefix == "un" or unstressed_prefix == "ur" then previous_prefixtype = "un" elseif rfind(prefix, stress_c) then previous_prefixtype = "stressed" elseif prefix == "zu" then previous_prefixtype = "unstressed-zu" else previous_prefixtype = "unstressed" end if has_user_specified_primary_stress(prefix) then saw_primary_prefix_stress = true if previous_prefixtype == "un" then saw_primary_un_stress = true end end -- The user didn't request stress, so replace stress marks with double-grave, which preserves length -- in originally stressed syllables (e.g. in über-). local respelling = check_for_affix_respelling(prefix, prefixes, "replace stress with double grave") local must_continue = false if respelling then local respelling_parts = rsplit(respelling, "<") if #respelling_parts > 1 then replace_part_with_multiple_parts(respelling_parts, from_left, "<") must_continue = true end end if not must_continue then table.insert(retparts, prefix) from_left = from_left + 2 end end

-- Break off any explicitly-specified suffixes. local insert_position = #retparts + 1 local from_right = #parts -- FIXME! Recognize >>. while from_right > 1 and parts[from_right - 1] == ">" do		if has_user_specified_primary_stress(parts[from_right]) then saw_user_primary_suffix_stress = true saw_primary_suffix_stress = true end -- FIXME! Use check_for_affix_respelling. table.insert(retparts, insert_position, parts[from_right]) from_right = from_right - 2 end if from_left ~= from_right then error("Saw in word: " .. word) end

local mainpart = parts[from_left] local saw_user_primary_mainpart_stress = has_user_specified_primary_stress(mainpart)

-- Split off any remaining suffixes. Do this before splitting prefixes as for some prefixes (e.g. ur-) we need to -- know if there is a stressed suffix. while true do -- If there was a user-specified suffix with explicit stress, don't try to look for more suffixes. if saw_user_primary_suffix_stress then break end local broke_suffix = false -- FIXME! Handle secondary-stressed suffixes. -- FIXME! Handle restrictions on -chen suffix (initial capital unless component-like suffix occurs as in		-- scheibchenweise). for _, suffixspec in ipairs(suffixes) do			local suffix_pattern = suffixspec[1] local rest = rmatch(mainpart, "^(.-)" .. suffix_pattern .. "$") if rest then if not meets_restriction(rest, suffixspec.restriction) then -- restriction not met, don't split here elseif rfind(rest, "%+$") then -- explicit non-boundary here, so don't split here elseif not rfind(rest, V) then -- no vowels, don't split here else local suffix_respell = decompose(suffixspec[2]) if saw_user_primary_mainpart_stress and rfind(suffix_respell, ACUTE) then -- primary-stressed suffix like -iert but main part already has primary stress; don't split here else -- Use non_V so that we pick up things like explicit syllable divisions, which we -- check for below. local before_cluster, final_cluster = rmatch(rest, "^(.-)(" .. non_V .. "*)$")						if rfind(final_cluster, "%..") then -- syllable division within or before final cluster, don't split here else -- break the word in two; next iteration we process the rest, which may need breaking -- again mainpart = rest

-- We may remove the suffix grave entirely later on. For now, convert it to a special -- symbol so we can handle it properly later (in handle_suffix_secondary_stress), after -- splitting on syllables. suffix_respell = gsub(suffix_respell, GRAVE, ORIG_SUFFIX_GRAVE) if rfind(suffix_respell, ACUTE) then saw_primary_suffix_stress = true -- If there is primary prefix stress, we later convert this to AUTOGRAVE. suffix_respell = gsub(suffix_respell, ACUTE, AUTOACUTE) end table.insert(retparts, insert_position, suffix_respell) broke_suffix = true break end end end end end -- There is a surrounding loop with the idea that we could handle multiple suffixes (e.g. inflectional suffixes), but -- it seems unlikely we will do so. break end

-- Split off any remaining prefixes. while true do		local broke_prefix = false for _, prefixspec in ipairs(prefixes) do			local prefix_pattern = prefixspec[1] local prefix_respell = decompose(prefixspec[2]) local stressed_prefix = rfind(prefix_respell, ACUTE) local prefix, rest = rmatch(mainpart, "^(" .. prefix_pattern .. ")(.*)$")			if not prefix and stressed_prefix then -- Also check for secondary-stress variant. local secstressed_prefix = prefixspec.secstress or rsub(prefix_pattern, "^(.-" .. V .. accent_c .. "*)", "%1" .. GRAVE) prefix, rest = rmatch(mainpart, "^(" .. secstressed_prefix .. ")(.*)$")				if prefix then prefix_respell = rsub(prefix_respell, ACUTE, GRAVE) end end if prefix then local prefixtype = prefixspec.prefixtype or stressed_prefix and "stressed" or "unstressed" if prefixspec.not_with_following_primary_stress and (					saw_user_primary_mainpart_stress or saw_primary_suffix_stress) then -- prefix not allowed when mainpart or suffix stress, don't split here elseif not meets_restriction(rest, prefixspec.restriction) then -- restriction not met, don't split here elseif rfind(rest, "^%+") then -- explicit non-boundary here, so don't split here elseif not rfind(rest, V) then -- no vowels, don't split here elseif rfind(rest, "^..?$") then -- only two letters, unlikely to be a word, probably an ending, so don't split -- here else -- Use non_V so that we pick up things like explicit syllable divisions, which will -- prevent the allowed-onset check from succeeding. local initial_cluster, after_cluster = rmatch(rest, "^(" .. non_V .. "*)(.-)$")					if not check_onset_offset(initial_cluster, allowed_onsets) then -- initial cluster isn't a possible onset, don't split here elseif rfind(after_cluster, "^" .. V .. "?$") then -- remainder is a cluster + single vowel, unlikely to be a word so don't split here -- most such words have impermissible onsets, but cf. Beta, Bete, Bede, -- Bethe, Geste, verso, Verve, vorne, Erbe, Erde, ergo, -- Erle, erste, etc. elseif not rfind(prefix, stress_c) and rfind(after_cluster, "^e" .. C_not_h .. "$") then -- remainder is a cluster + e + single consonant after an unstressed prefix, unlikely -- to be a word so don't split here; most such words have impermissible onsets, but cf. -- Bebel, beben, Besen, beten, geben, Geber, gegen, -- gehen, geten, Becher, Gegner, Verschen, erben, erden, -- Erker, erlen, Erpel, erzen, Erster, etc.; a few legitimate -- prefixed words get rejected, e.g. Beleg, Gebet, which need respelling elseif previous_prefixtype and not prefix_previous_allowed_states[prefixtype][previous_prefixtype] then -- disallowed prefixtype transition else -- break the word in two; next iteration we process the rest, which may need breaking -- again mainpart = rest previous_prefixtype = prefixtype prefix_respell = replace_stress_with_auto(prefix_respell) if rfind(prefix_respell, AUTOACUTE) then -- Stressed prefix. If we've seen un- already, the prefix loses its stress (marked with							-- double grave to preserve length on the stressed syllable, in über-); cf. unausgegoren -- 'únausgegòren'. Otherwise if we've seen a stressed prefix, the prefix gets secondary -- stress, cf. überbeanspruchen 'ǘberbeànspruchen'. Otherwise it retains primary -- stress. if saw_primary_un_stress then prefix_respell = gsub(prefix_respell, AUTOACUTE, DOUBLEGRAVE) elseif saw_primary_prefix_stress or saw_user_primary_mainpart_stress or saw_user_primary_suffix_stress then if saw_primary_prefix_stress then -- main part should not get stress, as in überbeanspruchen saw_double_primary_prefix_stress = true end prefix_respell = gsub(prefix_respell, AUTOACUTE, AUTOGRAVE) end saw_primary_prefix_stress = true end -- Split on < (e.g. for auseindander- respelled 'aus<einánder') and insert each part. local prefix_respell_parts = rsplit(prefix_respell, "<") for _, part in ipairs(prefix_respell_parts) do							table.insert(retparts, insert_position, part) insert_position = insert_position + 1 end broke_prefix = true break end end end end if not broke_prefix then break end end

if rfind(mainpart, DOTUNDER) then -- remove DOTUNDER but don't accent mainpart = gsub(mainpart, DOTUNDER, "") elseif saw_user_primary_mainpart_stress or saw_primary_suffix_stress or saw_double_primary_prefix_stress then -- do nothing elseif rfind(mainpart, "^" .. non_V .. "*" .. V .. accent_c .. "*" .. stress_c) then -- first vowel already followed by a stress accent; do nothing elseif saw_primary_prefix_stress and rfind(mainpart, GRAVE) then -- going to add secondary stress but secondary stress already present, e.g. hinausposaunen respelled -- 'hinausposàunen'; do nothing else -- Add primary or secondary stress on the part; primary stress if no primary stress yet, otherwise secondary -- stress. local accent_to_add = saw_primary_prefix_stress and AUTOGRAVE or AUTOACUTE mainpart = add_stress_to_mainpart(mainpart, accent_to_add) end table.insert(retparts, insert_position, mainpart) insert_position = insert_position + 1

-- remove any +, which has served its purpose for i, part in ipairs(retparts) do		retparts[i] = gsub(part, "%+", "") end

-- Put components together. Put ⁀ between two prefixes or between prefix and main part. Suffixes may join directly -- to preceding part; otherwise, put ‿ before a suffix. local wordparts = {} for i, part in ipairs(retparts) do		if i >= insert_position then -- Handling a suffix. Vowel-initial suffixes join directly to the preceding part so that e.g. written 'ig' -- is pronounced as IPA /ɪɡ/ not as /ɪç/. Primary-stressed consonant-initial suffixes (-tät, -tion) also -- join directly to the preceding part so that a preceding vowel-consonant sequence as in Fakultät, -- Konvention does not result in a close vowel, e.g. /kɔnvɛnˈt͡si̯oːn/ not #/kɔnvenˈt͡si̯oːn/ (in other			-- consonant-initial suffixes, the vowel does lengthen and become close, as in möglich /ˈmøːklɪç/). local join_directly = rfind(part, "^" .. V) or rfind(part, AUTOACUTE) or rfind(part, ACUTE) if saw_primary_prefix_stress then -- Primary-stressed suffix gets demoted to secondary stress if there is a primary-stressed prefix, e.g. -- ausprobieren /ˈaʊ̯spʁoˌbiːʁən/. part = gsub(part, AUTOACUTE, AUTOGRAVE) end if join_directly then wordparts[#wordparts] = wordparts[#wordparts] .. part else table.insert(wordparts, "‿" .. part) end else -- handling a prefix or main part table.insert(wordparts, (i > 1 and "⁀" or "") .. part) end end return table.concat(wordparts) end

-- Secondary stress in a suffix disappears if there is a preceding secondary stress or if the immediately preceding -- syllable has primary stress. If there are multiple such secondary stresses, I *think* we need to process them -- left to right because of the way words are built up. E.g. FOObarkeit is really (FOO + -bar) + -keit, so if FOO is -- one syllable, we get Fúbarkèit (-bar loses secondary stress because of the immediately preceding primary stress, -- then -keit keeps it) but if FOO is two syllables, we get Fúdebàrkeit (-bar keeps stress so -keit loses it -- because there is a preceding secondary stress). local function handle_suffix_secondary_stress(word) if not rfind(word, ORIG_SUFFIX_GRAVE) then return word end local parts = strutils.capturing_split(word, "([.⁀‿]+)") local saw_secondary_stress = false local saw_primary_stress_preceding = false

-- There should never be ORIG_SUFFIX_GRAVE in a compound word (i.e. multicomponent word where the components were	-- originally separated by hyphen or double hyphen in respelling), so we should treat user-specified GRAVE as -- secondary stress rather than as primary stress in a non-primary component. local secondary_stress_c = "[" .. grave .. DOUBLEGRAVE .. ORIG_SUFFIX_GRAVE .. "]"	for i, part in ipairs(parts) do		if i % 2 == 1 then if saw_secondary_stress or saw_primary_stress_preceding then parts[i] = rsub(part, ORIG_SUFFIX_GRAVE, "") end saw_secondary_stress = saw_secondary_stress or rfind(part, secondary_stress_c) saw_primary_stress_preceding = rfind(part, acute_c) end end

return table.concat(parts) end

-- This function contains the main rules that transform an orthographic word into "phonemic" IPA. In reality we show -- some things that are arguably allophonic, such as displaying /əʁ/ as /ɐ/ (but otherwise syllable-final /ʁ/ is still -- shown as such); showing word-internal glottal stops; and showing /ŋ/, which in some analyses is treated as -- underlyingly being /nɡ/. -- -- The word on entry has already had the following applied: -- (1) Unicode decomposition. -- (2) Prefix and suffix splitting, with the addition of ⁀ or ‿ at boundaries of words, components, prefixes and --    suffixes; see comment at top of split_word_on_components_and_apply_affixes. -- (3) Addition of default primary and secondary stresses on the appropriate vowels. Autogenerated stresses use the --    symbols AUTOACUTE, AUTOGRAVE and ORIG_SUFFIX_GRAVE, while user-specified stresses use ACUTE and GRAVE, so that --    they can be distinguished (e.g. towards the end we remove autogenerated primary stress in monosyllabic words, --     but leave user-specified primary stress). -- The output of this function is maybe fed into apply_phonetic_rules to get "phonetic", i.e. narrower, IPA. In -- addition, the actual displayed phonemic pronunciation is obtained by removing ⁀ and ‿ symbols as well as the dot (.) -- symbol marking syllable boundaries. -- -- Note that if the input contains multiple words, they are split on spaces and passed separately to this function, -- and then put back together at the end. local function apply_phonemic_rules(word)

-- Utility function for displaying the partially formatted word using error. local function dbg(word) -- Don't display line number of dbg call itself, but of caller. error(reveal_unicode(word), 2) end

- Misc early conversions -

word = rsub(word, "ǝ", "ə") -- "Wrong" schwa (U+01DD) to correct schwa (U+0259) word = rsub(word, MACRON, "ː") -- The user can use respelling with macrons but internally we convert to the long mark ː word = rsub(word, "x", "ks") -- 'sch' and related graphemes; must precede handling of 'ts' word = rsub(word, "tsch", "ʧʧ") word = rsub(word, "dsch", "ʤʤ") word = rsub(word, "sch", "ʃʃ") -- We treat written 'ts' same as 'tz', e.g. aufwärts, Aufenhaltsgenehmigung, Rätsel and -- foreign-derived words such as Botsuana, Fietse, Lotse, Mitsubishi, Hatsa, Tsatsiki, -- Whatsapp. To prevent this, insert a syllable boundary (.), a component boundary (-), a prefix or suffix -- boundary, etc.	word = rsub(word, "t[sz]", "ʦʦ") word = rsub(word, "z", "ʦ") word = rsub(word, "qu", "kv") word = rsub(word, "q", "k") word = rsub(word, "w", "v") word = rsub(word, "dt", "tt")

- Generation of /ŋ/ -

-- Pinguin, Linguistik, konsanguin, Lingua franca, bilingual word = rsub(word, "ngu(" .. V .. ")", "ŋgu%1") -- In -ngr-, -ngl-, -ngV- with a following stress in the same component, the 'g' is usually pronounced and the 'n' -- is usually /ŋ/. Cf. Gangrän, kongruent, Kongruenz, Kaliningrad/Leningrad/Stalingrad, -- Kongress, Mangrove, Sangria, Anglikaner, anglikanisch, Anglistik, Anglizismus, -- Bangladesch, Jonglage/Jongleur/jonglieren, Konglomerat, Angela, Angeliter, -- Angola, Angolaner, evangelisch, Evangelium, fingieren, fungieren, fungibilität, -- Gingivitis, Klingonisch, Kongolese, Kontingenz, Langobarde, Languste, Mangan, -- Mongolisch, rektangulär, restringieren, singapurisch, singulär, Singultus, -- stringent, tangibel, tangieren, triangulieren, Vercingetorix, etc. Cases without stress and -- with the 'g' unpronounced are hungrig, Angler, Spengler, Umzinglung, England, Engländer, -- Englisch, Denglisch, einenglischen, etc. Some of the above optionally have /n/ instead of /ŋ/; these -- need respelling with 'n.gr' or 'n.gl'. --	-- 'ng' followed by 'a' or 'o' also usually has pronounced 'g' and /ŋ/, e.g. Bungalow, Ingolstadt, -- Kongo, Manga, Singapur, Tango, Tonga, Ungar. --	-- Exceptions to both rules will need respelling, e.g. Ingrid 'Inggrid', Ganglion 'Ganggli.on' (one	-- pronunciation; unpronounced 'g' is also possible), Adstringens 'Adstrínggĕns', Anilingus 'Anilínggus', -- Ingrimm] 'Inggrimm', [[Singular 'Singgular' (although if spelled 'Singulàr', the first rule will take effect	-- with the following stress).	word = rsub_repeatedly(word, "ng([lr]?" .. V .. "[^⁀‿]*" .. stress_c .. ")", "ŋ.g%1")	word = rsub(word, "ng([lr]?[ao])", "ŋ.g%1")	word = rsub(word, "ng", "ŋŋ")	-- Cf. Funke, Gedanke, Kranker, Beschränkung, Erkrankung, Verrenkung, gelenkig,	-- stinkig, etc. with no following stress; trinkbar, Frankreich, etc. within a syllable; Bankett,	-- bankrott, Concorde, Delinquent ('qu' respelled 'kv' above, feeding this rule), flankieren,	-- Frankierung, Konklave, Konkordia, konkret, melancholisch (respelled 'melankólisch'),	-- etc. with following stress. Words with initial 'in-' pronounced /ɪn/ e.g. Inkarnation, Inklusion,	-- inkohärent, inkompatibel, inkompetent will need respelling with a syllable divider 'In.karnation',	-- 'In.klusion', 'in.ko.härent', in.kompatibel, in.kompetent. (Similarly ingeniös, Ingredienz	-- respelled 'in.geniös', 'In.grêdienz'.)	word = rsub(word, "nk", "ŋk")

- 'c' other than 'ch' -

-- Italian-derived words: Catenaccio, Stracciatella word = rsub(word, "cci(" .. V .. ")", "ʧʧ%1") -- Italian-derived words: Cappuccino, Fibonaccizahl word = rsub(word, "cc([ei])", "ʧʧ%1") word = rsub(word, "ck", "kk") -- Mostly Romance-origin words: Account, Alpacca, Broccoli, Latte macchiato, Macchie, Mocca, -- Occasion, Occopirmus, Piccolo, Rebecca, staccatoartig, Zucchini, etc. This needs to -- go before kh -> k so that cch -> kk. word = rsub(word, "cc", "kk") word = rsub(word, "c([^h])", "ʦ%1")

- 'y' -

--[=[	In general, we convert 'y' to either 'ü', 'i', 'I' or 'j'. Do this before handling diphthongs because the resulting 'i' may be in the combination 'ai' or 'ei', which needs to be handled as a diphthong.

1. Accented 'y' always becomes 'ü'. 2. Initial 'y' followed by a vowel becomes 'j', as in New York respelled 'Nju Yórk' /njuː ˈjɔrk/, Yacht /jaxt/, Yak /jak/, Yang /jaŋ/, Yannick /ˈjanɪk/, Yen /jɛn/, Yersiniose /jɛʁziˈni̯oːzə/, Yeti /ˈjeːti/, Yobibyte respelled 'Yóbibàit' /ˈjoːbiˌbaɪ̯t/, Yoga /ˈjoːɡa/, Yoni /ˈjoːni/, Yottabyte respelled 'Yóttabàit' /ˈjɔtaˌbaɪ̯t/, Yuppie respelled 'Yuppi' /ˈjʊpi/. 3. Initial 'y' not followed by a vowel becomes 'ü', as in Ylid /yˈliːt/, Yperit /ypəˈʁiːt/, Ypern /ˈyːpɐn/, Ypsilon /ˈʏpsilɔn/, Ytterbium /ʏˈtɛʁbi̯ʊm/, Yttererde /ˈʏtɐˌʔeːɐ̯də/, Yttrium /ˈʏtʁiʊm/. Some words of this form need respelling: Ybbsitz respelled 'Ibbsitz', Ysop respelled 'Isop', Yspertal respelled 'Isper-tal', Yvonne respelled 'Ivónn'. 4. Final 'y' after a consonant becomes 'i', as in Hobby /ˈhɔbi/, Body respelled 'Boddy' /ˈbɔdi/, Stransky /ˈʃtʀanski/, Monopoly /moˈnoːpoli/, Whisky /ˈwɪski/, Sony /ˈzoːni/, etc.	5. Other 'y' after a consonant becomes 'ü', as in symmetrisch /zʏˈmeːtʁɪʃ/, Pyramide /ˌpyʁaˈmiːdə/, Psychologie /psyçoloˈɡiː/, Acryl /aˈkʁyːl/, Aerodynamik /aeʁodyˈnaːmɪk/, Ägypten /ˌɛˈɡʏptn̩/, analytisch /anaˈlyːtɪʃ/, Beryllium /beˈʁʏli̯ʊm/, etc. Also between a consonant and a vowel, e.g. Cyan /t͡syˈaːn/, Kryometer /kʁyoˈmeːtɐ/, Dryade /dʁyˈaːdə/, Eukaryot /ɔɪ̯kaʁyˈoːt/, euryök /ɔɪ̯ʁyˈʔøːk/, Harpyie /haʁˈpyːjə/, Amblyopie /amblyoˈpiː/, Myon /ˈmyːɔn/, Karyatide /kaʁyaˈtiːdə/. Some words of the latter form need respelling, e.g. Myanmar respelled 'Miánmahr' /ˈmi̯anmaːɐ̯/, Libyen respelled 'Libien' /ˈliːbi̯ən/, Magyar respelled 'Madjáhr' /maˈdjaːɐ̯/, Polyamorie respelled 'Poly-amoríe' /ˌpoːliʔamoˈʁiː/, Polyester respelled 'Poliéster' /poˈli̯ɛstɐ/, Prokaryot respelled 'Prokary̯ót' /pʁokaˈʁy̯oːt/, Rallye respelled 'Rally' /ˈʁali/ or 'Relly' /ˈʁɛli/, Canyon respelled 'Kenjen' /ˈkɛnjən/, Babyöl respelled 'Beby-öl' /ˈbeːbiˌʔøːl/, Ichthyologie respelled 'Ichthy̯ologie' /ɪçty̯oloˈɡiː/, etc. More rarely, some words of the former form also need respelling, e.g. Calypso respelled 'Kalípso' /kaˈlɪpso/. 6. In the sequences 'ay', 'oy' not followed by a vowel, or followed by e/i/u and no stress follows, the 'y' becomes 'I', e.g. Bayern /ˈbaɪ̯ɐn/, Hoyerswerda respelled 'Hoyers-*verda' /hɔɪ̯ɐsˈvɛʁda/, Mayer /ˈmaɪ̯ɐ/, Paraguay /ˈpaːʁaɡvaɪ̯/ or /ˈpaʁaɡvaɪ̯/ or /paʁaˈɡu̯aɪ̯/, Uraguayer /ˈuːʁuɡvaɪ̯ɐ/ or /ˈʊʁuɡvaɪ̯ɐ/ or /ˌuʁuˈɡu̯aɪ̯ɐ/, Payerbach respelled 'Payer-bach', Bayreuth /baɪ̯ˈʁɔʏ̯t/, bayrisch /ˈbaɪ̯ʁɪʃ/, Boykott /ˌbɔɪ̯ˈkɔt/, Malaysia /maˈlaɪ̯zi̯a/, Maybach /ˈmaɪbax/. 7. In the sequence 'ey' in the same circumstances as the previous entry, the 'y' becomes 'i', e.g. Meyer /ˈmaɪ̯ɐ/, Leyermann respelled 'Leyer-mann' /ˈlaɪ̯ɐˌman/, Speyer /ˈʃpaɪ̯ɐ/, Leyen /ˈlaɪ̯ən/, obsolete spellings like beyde for beide or dabey for dabei or meyn for mein, Geysir /ɡaɪ̯ˈziːʁ/. 8. Other 'y' after vowels becomes 'j', e.g. Alija /aˈlija/, Ayatollah /ˌajaˈtɔla/, Ayurveda /ajʊʁˈveːda/, Cayenne respelled 'Kayén' /kaˈjɛn/, Chaya respelled 'Tschaya' /ˈt͡ʃaːja/, Cherimoya respelled 'Tscherimoya' /t͡ʃeʁiˈmoːja/, flamboyant respelled 'flãbOayant' /flɑ̃bo̯aˈjant/, Guyana /ɡuˈjaːna/, Französisch-Guayana /[fʁanˌt͡søːzɪʃ ɡuaˈjaːna/, Himalaya /hiˈmaːlaja/ or /himaˈlaːja/, Larmoyanz respelled 'LarmOayanz' /laʁmo̯aˈjant͡s/, loyal /loˈjaːl/, Malayalam /malaˈjaːlam/ or /malajaˈlaːm/, Maya /ˈmaːja/, Mayo /ˈmaːjo/, oktroyieren respelled 'oktrOayieren' /ɔktʁo̯aˈjiːʁən/, Oriya /oˈʁiːja/, Papaya /paˈpaːja/, Toyota /toˈjoːta/. ]=]	word = rsub(word, "y(" .. accent_c .. "*" .. stress_c .. ")", "ü%1") -- #1 above word = rsub(word, "([⁀‿])y(" .. V .. ")", "%1j%2") -- #2 above word = rsub(word, "([⁀‿])y", "%1ü") -- #3 above word = rsub(word, "(" .. C .. "%.?)y([⁀‿])", "%1i%2") -- #4 above word = rsub(word, "(" .. C .. "%.?)y", "%1ü") -- #5 above word = rsub(word, "([ao]" .. accent_c .. "*" .. ")y(" .. non_V .. ")", "%1I%2") -- #6 above word = rsub(word, "([ao]" .. accent_c .. "*" .. ")y([eiu" .. schwalike .. "]" .. non_stress_c .. "*[⁀‿])", "%1I%2") -- #6 above word = rsub(word, "(e" .. accent_c .. "*" .. ")y(" .. non_V .. ")", "%1i%2") -- #7 above word = rsub(word, "(e" .. accent_c .. "*" .. ")y([eiu" .. schwalike .. "]" .. non_stress_c .. "*[⁀‿])", "%1i%2") -- #7 above word = rsub(word, "y", "j") -- #8 above

- Diphthongs -

-- 'äu', 'eu' -> /ɔɪ̯/. /ɪ̯/ is two characters so we use I to represent it and convert later. word = rsub_repeatedly(word, "[äe](" .. accent_c .. "*" .. ")u(" .. non_accent_c .. ")", "ɔ%1I%2") -- 'au' -> /aʊ̯/. /ʊ̯/ is two characters so we use U to represent it and convert later. word = rsub_repeatedly(word, "a(" .. accent_c .. "*" .. ")u(" .. non_accent_c .. ")", "a%1U%2") -- 'ai', 'ei' -> /aɪ̯/. word = rsub_repeatedly(word, "[ae](" .. accent_c .. "*" .. ")i(" .. non_accent_c .. ")", "a%1I%2") -- 'ie' -> /iː/. But be careful not to trigger on nonsyllabic 'i̯', as in Familie. word = rsub_repeatedly(word, "i(" .. accent_non_invbrevebelow_c .. "*" .. ")e(" .. non_accent_c .. ")", "iː%1%2") -- Doubled vowels as in Haar, Boot, Schnee, dööfer, etc. become long. This should follow -ie- -- handling to get the right output for 'knieen' (respelling or old spelling of knien). word = rsub_repeatedly(word, "([aeoäöü])(" .. accent_c .. "*" .. ")%1(" .. non_accent_c .. ")", "%1ː%2%3")

- Consonant clusters with 'h' -

-- Not ff. Compare Graph with long /a/, Apostrophe with long second /o/. word = rsub(word, "ph", "f") -- dh: Buddha, Abu Dhabi, Sindhi; Adhäsion, adhäsiv are exceptions, can be written ad.häsív --    etc.	-- th: Methode, Abendroth, Absinth, Theater, Agathe (note long /a/ here), Akolyth (note long	--      /y/ here), Algorithmus, katholisch, etc. -- kh: Dzongkha, khaki. Fed by 'cch' above. -- gh: Afghane, Afghanistan, Balogh, Edinburgh, Ghana, Ghetto, Ghostwriter, Joghurt, --    maghrebinisch, Sorghum, Spaghetti. -- bh: Bhutan. -- rh: Arrhythmie, Rhythmus, Rhodos, Rhabarber, Rhapsodie, Rheda, Rhein, Rhenium, --    Rhetorik, Rheuma, rhexigen, Rhone, etc. -- wh: Whatsapp, Whirlpool, Whisky, Whist, Whistleblower, Whiteboard, Whiteout, etc.	word = rsub(word, "([dtkgbrw])h", "%1") -- Handling of 'ch'. Must follow diphthong handling so 'äu', 'eu' trigger ich-laut. word = rsub(word, "chs", "ks") -- When implementing ach-laut, make sure to account for 'h' in respelling (which hasn't yet been converted to the	-- length marker 'ː'). word = rsub(word, "([aɔoʊuU][" .. accent .. "h]*)ch", "%1xx") word = rsub(word, "ch", "çç")

- Remaining 'h' -

--[=[	'h' between vowels should be lengthening only if no stress follows and the following vowel isn't a or o, and if followed by i or u, that vowel should not be word-final (cf. Estomihi, Uhu, Bahuvrihi with pronounced	'h' before word-final 'i' or 'u').

Cf. Reha /ˈʁeːha/, Dschihadist /d͡ʒihaːˈdɪst/, Johann /ˈjoːhan/, Lahar /ˈlaːhaʁ/, Mahagoni /ˌmahaˈɡoːni/, Maharadscha /ˌmahaˈʁaːdʒa/ (prescriptive) or /ˌmahaˈʁadʒa/ (more common) or /ˌmahaˈʁatʃa/ (usual), Mohammed /ˈmoː.(h)a.mɛt/, Rehabilitation /ˌʁehabilitaˈt͡si̯oːn/, stenohalin /ʃtenohaˈliːn/, Tomahawk /ˈtɔ.ma.haːk/ or /ˈtɔ.ma.hoːk/, Bethlehem /ˈbeːt.ləˌhɛm/ or /ˈbeːt.ləˌheːm/ (sometimes without	secondary stress), Bohemien /bo.(h)eˈmjɛ̃/, Bohemistik /boheˈmɪstɪk/, Ahorn /ˈʔaːhɔʁn/, Alkoholismus /ˌalkohoˈlɪsmʊs/, Jehova /jeˈhoːva/, Kohorte /koˈhɔʁtə/, Nihonium /niˈhoːni̯ʊm/, Bahuvrihi /bahuˈvʁiːhi/, Marihuana 'Marihu.ána' /maʁihuˈaːna/ (one pronunciation per dewikt), abstrahieren /apstʁaˈhiːʁən/, ahistorisch /ˈahɪsˌtoːʁɪʃ/, Annihilation /anihilaˈt͡si̯oːn/, Antihistaminikum /antihɪstaˈmiːnikʊm/ Mohammedaner /mohameˈdaːnɐ/, nihilistisch /nihiˈlɪstɪʃ/, Prohibition /pʁohibiˈt͡si̯oːn/, Vehikel /veˈhiːkəl/, huhu /ˈhuːhu/, Uhu /ˈuːhu/, Estomihi /ɛstoˈmiːhi/, Tohuwabohu /ˌtoːhuvaˈboːhu/.

Cf. (not pronounced) Abziehung /ˈapt͡siːʊŋ/, Aufblähung /ˈaʊ̯fˌblɛːʊŋ/, Auferstehung /ˈaʊ̯f(ʔ)ɛʁˈʃteːʊŋ/, Bedrohung /bəˈdʁoːʊŋ/, arbeitsfähig /ˈaʁbaɪ̯t͡sˌfɛːɪç/, befähigt /bəˈfɛːɪçt/, Beruhigen /bəˈʁuːɪɡən/, Ehe, /ˈeːə/, viehisch /ˈfiːɪʃ/.

Exception needing respelling with '.' before the 'h': Uhudler 'U.huhdler' /ˈuːhuːdlɐ/.

We temporarily convert 'h' that should be preserved to 'H', then remove remaining 'h' after vowel, then convert 'H' back to 'h'. ]=]	word = rsub_repeatedly(word, "h(" .. V .. "[^⁀‿]*" .. stress_c .. ")", "H%1") word = rsub(word, "h([ao])", "H%1") word = rsub(word, "h([iu][⁀‿])", "H%1") -- Remaining 'h' after a vowel (not including a glide like I or U from dipthongs) indicates vowel length. Make sure -- to put the ː before stress marks. word = rsub(word, "(" .. V_non_glide .. accent_non_invbrevebelow_c .. "-)(" .. stress_c .. "*" .. ")h", "%1ː%2") -- If we ended up with two ː signs (e.g. from superfluous lengthening 'h' as in Vieh, ziehen), remove one -- of them. word = rsub(word, "ːː", "ː") -- Remaining 'h' after a vowel is superfluous, e.g. rauh, leihen, Geweih. word = rsub(word, "(" .. V .. accent_c .. "*" .. ")h", "%1") -- Convert special 'H' symbol (to indicate pronounced /h/ between vowels) back to /h/. word = rsub(word, "H", "h")

- French and English sounds -

word = rsub(word, "([eo])(" .. accent_c .. "*[IU])", function(eo, iu)		-- 'eI' as in 'SpreI' for Spray; 'eU' not in French or English but kept for parallelism		-- 'oI' as in 'KauboI' for Cowboy; 'oU' as in 'HoUmpehdsch' for Homepage		local lower_eo = {["e"] = "ɛ", ["o"] = "ɔ"}		return lower_eo[eo] .. iu	end) word = rsub(word, "e" .. TILDE, "ɛ" .. TILDE) word = rsub(word, "ö" .. TILDE, "œ" .. TILDE)

- 's' -

word = rsub(word, "([⁀‿])s(" .. V .. ")", "%1z%2") word = rsub(word, "([" .. vowel .. "lrmnŋj]" .. accent_c .. "*%.?)s(" .. V .. ")", "%1z%2") -- /bs/ -> [pz] before the stress as in absentieren, Absinth, absolut, absorbieren, absurd, -- observieren, obsessiv, Obsidian, Obsoleszenz, subsumtiv, but -> [ps] after the stress, -- as in Erbse, Kebse, obsen, schubsen. Similarly for /ds/ -> [tz] before the stress as in -- adsorbieren but -> [ts] after the stress as in Landser. Note, we do not count cases like betriebsam, -- beredsam with [pz], [tz] after the stress; -sam is treated as a separate word. In Trübsal, -sal is a	-- suffix and should be treated as if a separate word; hence, lengthened ü. Überbleibsel is given with [ps] -- in dewikt and Pons, consistent with the rule, but with [bz] in enwikt, which may be wrong. Absence is	-- given with word-final stress and [ps] in dewikt, contrary to the rule, but this is an unadapted French word -- with a nasal vowel in it. --	-- For /gs/, bugsieren is an exception with [ks]. Cf. also pumperlgsund with [ks]. All other examples with -- /gs/ in enwikt have a component boundary in the middle. --	-- No apparent examples involving /vs/, /zs/, /(d)ʒs/ that don't involve clear morpheme boundaries. word = rsub(word, "([bd])s(" .. V .. "[^⁀‿]*" .. stress_c .. ")", "%1z%2") word = rsub(word, "⁀s([pt])", "⁀ʃ%1")

- Reduce extraneous geminate consonants -

-- Reduce extraneous geminate consonants (some generated above when handling digraphs etc.). Geminate consonants -- only have an effect directly after a vowel, and not when before a consonant other than l or r. -- FIXME: Probably not necessary since we remove geminates down below after vowel lengthening, and causes problems. -- Once we're sure these aren't necessary, remove them. -- word = rsub(word, "(" .. non_V .. ")(" .. C .. ")%2", "%1%2")	-- word = rsub(word, "(" .. C .. ")%1(" .. C_not_lr .. ")", "%1%2")

- 'i' and 'u' in hiatus -

-- 'i' and 'u' in hiatus should be nonsyllabic by default; add '.' afterwards to prevent this. word = rsub(word, "(" .. C .. "[iu])(" .. V .. ")", "%1" .. INVBREVEBELOW .. "%2")

- Divide into syllables -

-- Existing potentially-relevant hyphenations/pronunciations: abenteuerdurstig -> aben|teu|er|durs|tig, -- Agraffe -> /aˈɡʁafə/ (but Ag|raf|fe); Agrarbiologie -> /aˈɡʁaːʁbioloˌɡiː/ (but Ag|rar|bio|lo|gie); -- Anagramm -> (per dewikt) [anaˈɡʁam] Ana·gramm; Abraham -> /ˈaːbʁaˌha(ː)m/; Fabrik -> (per dewikt) -- [faˈbʁiːk] Fa·b·rik; administrativ -> /atminɪstʁaˈtiːf/; adjazent -> (per dewikt) [ˌatjaˈt͡sɛnt] -- ad·ja·zent; Adjektiv -> /ˈa.djɛkˌtiːf/ or /ˈat.jɛk-/, but only [ˈatjɛktiːf] per dewikt; Adlatus -> -- /ˌatˈlaːtʊs/ or /ˌaˈdlaːtʊs/ (same in dewikt); Adler -> /ˈaːdlɐ/; adrett -> /aˈdʁɛt/; Adstrat -> -- [atˈstʁaːt]; asthenisch -> /asˈteːnɪʃ/; Asthenosphäre -> /astenoˈsfɛːʁə/; Asthma -> [ˈast.ma]; -- Astronaut -> /ˌas.tʁoˈnaʊ̯t/; asturisch -> /asˈtuːʁɪʃ/; synchron -> /zʏnˈkʁoːn/; Syndrom -> -- [zʏnˈdʁoːm]; System -> /zɪsˈteːm/ or /zʏsˈteːm/. -- Change user-specified. into SYLDIV so we don't shuffle it around when dividing into syllables. word = rsub(word, "%.", SYLDIV) -- Divide before the last consonant. We then move the syllable division marker leftwards over clusters that can -- form onsets. word = rsub_repeatedly(word, "(" .. V .. accent_c .. "*" .. C .. "-)(" .. C .. V .. ")", "%1.%2")	-- Cases like Agrobiologie /ˈaːɡʁobioloˌɡiː/ show that Cl or Cr where C is a non-sibilant obstruent should be -- kept together. It's unclear with 'dl', cf. Adlatus /ˌatˈlaːtʊs/ or /ˌaˈdlaːtʊs/; but the numerous words -- like Adler /ˈaːdlɐ/ (reduced from *Adeler) suggest we should keep 'dl' together. For 'tl', we have -- on the one hand Atlas /ˈatlas/ and Detlef [ˈdɛtlɛf] (dewikt) but on the other hand Bethlehem -- /ˈbeːt.ləˌhɛm/. For simplicity we treat 'dl' and 'tl' the same. word = rsub(word, "(" .. obstruent_non_sibilant_c .. ")%.([lr])", ".%1%2") -- Cf. Liquid [liˈkviːt]; Liquida [ˈliːkvida]; Mikwe /miˈkveː/; Taekwondo [tɛˈkvɔndo] (dewikt); -- Uruguayerin [ˈuːʁuɡvaɪ̯əʁɪn] (dewikt). word = rsub(word, "([kg])%.v", ".%1v") -- We need special handling of /ks/ so it stays together on the same side of the syllable divider but triggers -- vowel shortening; cf. Oxid /ɔˈksiːt/, Reflexion /ʁeflɛˈksi̯oːn/. The extra /k/ will be removed below -- when we remove remaining geminates within a component. word = rsub(word, "k%.s", "k.ks") -- Signal [zɪˈɡnaːl] (dewikt); designieren [dezɪˈɡniːʁən] (dewikt); if split 'g.n', we'd expect [k.n]. -- But notice the short 'i' (which we handle by a rule below). Cf. Kognition [ˌkɔɡniˈt͡si̯oːn] (dewikt) and -- Kognat [kɔˈɡnaːt] (dewikt) vs. Prognose [ˌpʁoˈɡnoːzə] (dewikt) with short closed 'o' (secondary stress	-- doesn't lengthen vowel before primary stress). Cf. also orthognath [ɔʁtoˈɡnaːt] (dewikt), prognath -- [pʁoˈɡnaːt] (dewikt). Cf. Agnes [ˈaɡnɛs] (dewikt) but /ˈaː.ɡnəs/ (enwikt). Cf. regnen /ˈʁeː.ɡnən/ -- "prescriptive standard" (enwikt), /ˈʁeːk.nən/ "most common" (enwikt). Similarly Gegner, segnen, -- Regnum. Also leugnen, Leugner with the same /gn/ prescriptive, /kn/ more common; whereas Zeugnis -- always with /kn/ (because -nis is a suffix). word = rsub(word, "g%.n", ".gn") -- Divide two vowels. But don't divide if the first vowel is indicated as non-syllabic (Familie, Ichthyologie,	-- etc.), or we will wrongly get e.g. 'Fa.mi.li.e' instead of 'Fa.mi.lie'. Also don't divide if the second vowel is a	-- glide (I or U), or we will wrongly get e.g. 'me.in' instead of 'mein' for mein. Do divide if the first vowel is -- a glide, in words like Bayern, which should be divided 'Bay.ern'. word = rsub_repeatedly(word, "(" .. V .. accent_non_invbrevebelow_c .. "*)(" .. V_non_glide .. ")", "%1.%2")	-- User-specified syllable divider should now be treated like regular one. word = rsub(word, SYLDIV, ".")

- Convert some unstressed 'e' to schwa -

-- It seems the following holds (excluding be- and ge- from -- consideration): -- 1. 'e' is less likely to be schwa before the stress than after. -- 2. Before the stress, 'e' in a closed syllable is never a schwa. -- 2. Initial unstressed 'e' as in Elektrizität is never a schwa. -- 3. Before the stress, 'e' in the first syllable is not normally a schwa, cf. Negativität, Benefaktiv. -- 4. Before the stress, 'e' in hiatus is not a schwa, cf. Idealisierung. -- 5. In open non-initial syllables before the stress, it *MAY* be a schwa, esp. before 'r', e.g.	--   Temperatur with /ə/; Generalität with /ə/; Souveränität with /ə/; Heterogenität with /ə/ before 'r' but /e/ later on; --   same for Kanzerogenität; but Immaterialität with /e/; Benediktiner optionally with /ə/ in second --   syllable; Pietät with /ə/; Sozietät with /ə/ but Varietät with /e/, abalienieren with /e/; --   Extremität with /e/; Illegalität with /e/; Integrität with /e/; Abbreviation with /e/; --   acetylieren with /e/; akkreditieren with /e/; ameliorieren with /e/; anästhesieren with /e/; --   degenerieren with /e/ /e/ /ə/; etc. -- 6. After the stress, 'e' is more likely to be schwa, e.g. zumindestens /t͡suˈmɪndəstəns/. But not always, --   e.g. Latex [ˈlaːtɛks] (dewikt); Index /ɪndɛks/; Alex /ˈalɛks/; Achilles [aˈxɪlɛs] (dewikt); --   Adjektiv /ˈa.djɛkˌtiːf/ or /ˈat.jɛk-/; Adstringens [atˈstrɪŋɡɛns]; Adverb /ˈat.vɛʁp/ --   (or /atˈvɛʁp/); Agens /ˈaːɡɛns/; Ahmed /ˈaxmɛt/; Bizeps /ˈbiːt͡sɛps/; Borretsch /ˈbɔʁɛt͡ʃ/; --   Bregenz /ˈbʁeːɡɛnt͡s/; Clemens [ˈkleːmɛns]; Comeback /ˈkambɛk/; Daniel /ˈdaːni̯ɛl/; --   Dezibel /ˈdeːt͡sibɛl/; Diabetes /diaˈbeːtəs/ or /-tɛs/; Dolmetscher /ˈdɔlmɛtʃər/; --   Dubstep /ˈdapstɛp/; etc. -- 7. Given this analysis, we do the following: --   a. If before the stress, 'e' -> schwa only in an internal open syllable preceding 'r'. --   b. If after the stress, 'e' -> schwa only before certain clusters: [lmrn]s? or [lr]ns? or rlns? or s?t? or --      nds?. --	-- Implement (7a) above.

word = rsub_repeatedly(word, "(" .. V .. "[^⁀‿]*)e(%.r[^⁀‿]*" .. stress_c .. ")", "%1ə%2") -- Implement (7b) above. -- First, make sure we don't convert any 'e' to schwa that is followed by an accent or length mark. word = rsub(word, "e(" .. accent_c .. ")", "E%1") -- Now convert 'e' to schwa when followed by the right clusters. Exclude 'e' from the capture directly following -- the 'e' that will be converted, so we work right-to-left and handle cases like Indexen. word = rsub_repeatedly(word, "e([^⁀" .. stress .. "e]*⁀)", function(rest)		local rest_no_syldiv = gsub(rest, "%.", "")		local cl = rmatch(rest_no_syldiv, "^(" .. C .. "*)")		if rfind(cl, "^[lmrn]s?$") or rfind(cl, "^[lr]ns?$") or rfind(cl, "^rlns?$") or rfind(cl, "^s?t?$") or			rfind(cl, "^nds?$") then			return "ə" .. rest		else			return "e" .. rest		end	end) -- Now put back the 'e' that we protected from being converted. word = rsub(word, "E", "e")

- Mark some vowels as short -

-- Handle vowel quality/length changes in open vs. closed syllables, part a: Some vowels that would be -- expected to be long (particularly before a single consonant word-finally) are actually short. -- Unstressed final 'i' before a single consonant is short: especially in -in, -is, -ig, but also -ik as in	-- Organik, Linguistik, etc.; -im as in Interim, Isegrim, Joachim, Muslim (one	-- pronunciation), privatim, Achim (one pronunciation), etc. Those in -il are usually end-stressed -- with long 'i', e.g. Automobil, fragil, Fossil, grazil, hellenophil, stabil, imbezil, -- mobil, fertil, etc. Those in -it are usually end-stressed and refer to minerals, where the 'i' can be -- long or short. Those in -id are usually end-stressed and refer to chemicals, with long 'i'; but cf. David -- /ˈdaːvɪt/; Ingrid [ˈɪŋɡʁɪt] or [ˈɪŋɡʁiːt] (dewikt). word = rsub(word, "i(" .. C .. "[⁀‿])", "i" .. BREVE .. "%1") -- 'i' before 'g' is short including across syllable boundaries without a following stress (Entschuldigung,	-- verständigen, Königin, ängstigend; ewiglich with voiced [g]). word = rsub_repeatedly(word, "i(%.?g[^⁀" .. stress .. "]*⁀)", "i" .. BREVE .. "%1") -- 'i' before 'gn' is short including across syllable boundaries, even with a following stress (Signal,	-- designieren, indigniert, Lignin with voiced [g]). Not commonly before gl + stress, e.g. -- Diglossie, Triglyph, Epiglottis or gr + stress, e.g. Digraph, Emigrant, Epigramm, -- filigran, Kalligraphie, Migräne, Milligramm. ewiglich, königlich etc. with short 'i' -- handled by previous entry. word = rsub(word, "i(%.?gn)", "i" .. BREVE .. "%1") -- Unstressed final '-us', '-um' normally short, e.g. Kaktus, Museum. word = rsub(word, "u([ms][⁀‿])", "u" .. BREVE .. "%1") -- Unstressed final '-on' normally short, e.g. Aaron, Abaton, Natron, Analogon, Myon, Anton -- (either long or short 'o'), Argon, Axon, Bariton, Biathlon, Bison, etc. Same with unstressed -- '-os', e.g. Albatros, Amos, Amphiprostylos, Barbados, Chaos, Epos, Gyros, Heros, -- Kokos, Kolchos, Kosmos, etc. word = rsub(word, "o([ns][⁀‿])", "o" .. BREVE .. "%1")

- Handle vowel quality/length changes -

-- Handle vowel quality/length changes in open vs. closed syllables, part b: Lengthen/shorten as appropriate. --	-- Vowel with secondary stress in open syllable before primary stress later in the same component takes close -- quality without lengthening. word = rsub_repeatedly(word, "(" .. V_unmarked_for_quality .. ")([" .. grave .. DOUBLEGRAVE .. "][.‿][^⁀]*" .. acute_c .. ")", "%1" .. CFLEX .. "%2") -- Any nasal vowel with secondary stress before primary stress later in the same component does not lengthen. -- Cf. Rendezvous. We signal that by inserting a circumflex before the tilde, which normally comes directly -- after the vowel. word = rsub_repeatedly(word, "(" .. V .. ")" .. TILDE .. "([" .. grave .. DOUBLEGRAVE .. "][^⁀]*" .. acute_c .. ")", "%1" .. CFLEX .. TILDE .. "%2") -- Vowel with tertiary stress in open syllable before secondary stress later in the same component takes close -- quality without lengthening if component has no primary stress. word = rsub_repeatedly(word, "(⁀[^⁀" .. acute .. "]*" .. V_unmarked_for_quality .. ")(" .. DOUBLEGRAVE .. "[.‿][^⁀" .. acute .. "]*" .. grave_c .. "[^⁀" .. acute .. "]*⁀)",		"%1" .. CFLEX .. "%2") -- Any nasal vowel with tertiary stress before secondary stress later in the same component takes does not lengthen -- if component has no primary stress. See above change for Rendezvous. word = rsub_repeatedly(word, "(⁀[^⁀" .. acute .. "]*" .. V .. ")" .. TILDE .. "(" .. DOUBLEGRAVE .. "[^⁀" .. acute .. "]*" .. grave_c .. "[^⁀" .. acute .. "]*⁀)",		"%1" .. CFLEX .. TILDE .. "%2") -- Remaining stressed vowel in open syllable lengthens. word = rsub(word, "(" .. V_unmarked_for_quality .. ")(" .. stress_c .. "[.⁀‿])", "%1ː%2") -- Same when followed by a single consonant word-finally or before a suffix. word = rsub(word, "(" .. V_unmarked_for_quality .. ")(" .. stress_c .. C .. "[⁀‿])", "%1ː%2") -- Remaining stressed nasal vowel lengthens. word = rsub(word, "(" .. V .. TILDE .. ")(" .. stress_c .. ")", "%1ː%2") -- Now remove CFLEX before TILDE, which was inserted to prevent lengthening. word = rsub(word, CFLEX .. TILDE, TILDE) -- Unstressed vowel in open syllable takes close quality without lengthening. word = rsub(word, "(" .. V_unmarked_for_quality .. ")(" .. "[.⁀‿])", "%1" .. CFLEX .. "%2") -- Same when followed by a single consonant word-finally or before a suffix. word = rsub(word, "(" .. V_unmarked_for_quality .. ")(" .. C .. "[⁀‿])", "%1" .. CFLEX .. "%2") -- Remaining vowel followed by a consonant becomes short in quantity and open in quality. word = rsub(word, "(" .. V_unmarked_for_quality .. ")(" .. stress_c .. "?" .. C .. ")", "%1" .. BREVE .. "%2") -- Vowel explicitly marked long gets close quality. word = rsub(word, "(" .. V_unmarked_for_quality .. ")ː", "%1" .. CFLEX .. "ː") -- Now change vowel to appropriate quality. word = rsub(word, "(" .. V_unmarked_for_quality .. ")" .. CFLEX, {		["a"] = "a",		["e"] = "e",		["i"] = "i",		["o"] = "o",		["u"] = "u",		-- FIXME, should be split depending on "style"		["ä"] = "ɛ",		["ö"] = "ø",		["ü"] = "y",	}) word = rsub(word, "(" .. V_unmarked_for_quality .. ")" .. BREVE, {		["a"] = "a",		["e"] = "ɛ",		["i"] = "ɪ",		["o"] = "ɔ",		["u"] = "ʊ",		["ä"] = "ɛ",		["ö"] = "œ",		["ü"] = "ʏ",	}) -- Remove * that prevents vowel quality/length changes. word = rsub(word, "(" .. V .. ")%*", "%1")	-- Convert syllable-final -er to /ɐ/. Must precede adding glottal stops between vowels. word = rsub(word, "ər", "ɐ")

- Eliminate remaining geminate consonants -

-- Eliminate remaining geminate consonants within a component (geminates can legitimately exist across a component	-- boundary). These have served their purpose of keeping the preceding vowel short. Normally such geminates will -- always occur across a syllable boundary, but this may not be the case in the presence of user-specified syllable -- boundaries. We need to do this prior to coda devoicing to avoid a word like Buddha ending up as /ˈbʊtda/. local function eliminate_geminates(word) return rsub_repeatedly(word, "(" .. C .. ")([.‿]*)%1", "%2%1")	end word = eliminate_geminates(word)

- Devoice consonants coda-finally -

-- 'ĭg' is pronounced [ɪç] word-finally or before an obstruent (not before an approximant as in ewiglich or	-- Königreich when divided as ewig.lich, König.reich). word = rsub(word, "ɪg⁀", "ɪç⁀") word = rsub(word, "ɪg([.‿]*" .. C_not_lr .. ")", "ɪç%1") -- Devoice consonants coda-finally. There may be more than one such consonant to devoice (cf. Magd), or the -- consonant to devoice may be surrounded by non-voiced or non-devoicing consonants (cf. Herbst). word = rsub_repeatedly(word, "(" .. V .. accent_c .. "*" .. C .. "*)([bdgvzʒʤ])", function(init, voiced)		local voiced_to_voiceless = {			["b"] = "p",			["d"] = "t",			["g"] = "k",			["v"] = "f",			["z"] = "s",			["ʒ"] = "ʃ",			["ʤ"] = "ʧ",		}		return init .. voiced_to_voiceless[voiced]	end) -- Eliminate geminates again so we eliminate the 'd' in words like verwandte. word = eliminate_geminates(word)

- Add glottal stops -

-- Add glottal stop at beginning of component before a vowel. FIXME: Sometimes this should not be present, I think. -- We need symbols to control this. --	-- If previous component ends in a vowel, glottal stop is mandatory, e.g. wiederentdecken. word = rsub(word, "(" .. V .. accent_c .. "*⁀)(" .. V .. ")", "%1ʔ%2") -- Before a stressed vowel at a component boundary, glottal stop is mandatory (except at beginning of word, where	-- it is not shown in the phonemic notation). word = rsub(word, "([^⁀]⁀)(" .. V .. accent_c .. "*" .. stress_c .. ")", "%1ʔ%2") -- Before an unstressed vowel at a component boundary (again, except at beginning of word), glottal stop is -- optional, e.g. Aufenthalt /ˈaʊ̯f.(ʔ)ɛntˌhalt/. word = rsub(word, "([^⁀]⁀)(" .. V .. ")", "%1(ʔ)%2")

- Remove ⁀ and ‿ before non-syllabic components -

-- Remove ⁀ and ‿ (component/suffix boundaries) before non-syllabic components and suffixes. Examples where this -- can occur are final -t/-st inflectional suffixes (-t for third-person singular, second-person plural or past	-- participle; -st for second-person singular or superlative) and -s- interfix between components, which may be -- respelled with hyphens around it (in such a case it should be grouped with the preceding syllable). Don't do -- this at the beginning of a word (which normally shouldn't happen but might in a dialectal word). This should -- precede 'ts' -> 'ʦ' just below. word = rsub_repeatedly(word, "([^⁀‿])([⁀‿])(" .. non_V .. "+[⁀‿])", "%1%3")	-- FIXME: Consider removing ⁀ and ‿ after non-syllabic component/prefix at the beginning of a word in case of	-- dialectal spellings like 'gsund' for gesund; but to handle this properly we need additional rules to -- devoice the 'g' in such circumstances.

- Misc late symbol conversions -

-- -s- frequently occurs as a component by itself (really an interfix), e.g. in Wirtschaftswissenschaft -- respelled 'Wirtschaft-s-wissenschaft' to make it easier to identify suffixes like the -schaft in Wirtschaft. -- Once we remove ⁀ and ‿ before non-syllabic components and suffixes, we get lots of 'ts' that should be rendered -- as t͡s. Handle this now. This should also apply in 'd-s-' e.g. Abschiedsbrief respelled 'Abschied-s-brief' -- /ˈapʃiːt͡sˌbʁiːf/, as coda 'd' gets devoiced to /t/ above. word = rsub(word, "t(%.?)s", "%1ʦ")

-- Misc symbol conversions. word = rsub(word, ".", {		["I"] = "ɪ̯",		["U"] = "ʊ̯",		["ʧ"] = "t͡ʃ",		["ʤ"] = "d͡ʒ",		["ʦ"] = "t͡s",		["g"] = "ɡ", -- map to IPA ɡ		["r"] = "ʁ",		["ß"] = "s",	}) word = rsub(word, "p(%.?)f", "%1p͡f")

- Convert ORIG_SUFFIX_GRAVE to either GRAVE or nothing -

-- This must happen after removing ⁀ and ‿ before non-syllabic components and suffixes because it depends on being -- able to accurately identify syllables. word = handle_suffix_secondary_stress(word)

- Generate IPA stress marks -

-- Remove primary stress from monosyllabic words, but only for autogenerated stress, not if the user explicitly -- added a stress mark. word = rsub(word, "^(⁀⁀[^.⁀‿]*)" .. AUTOACUTE .. "([^.⁀‿]*⁀⁀)$", "%1%2") word = rsub(word, acute_c, "ˈ") word = rsub(word, grave_c, "ˌ") word = rsub(word, DOUBLEGRAVE, "") -- Move IPA stress marks to the beginning of the syllable. word = rsub(word, "([.⁀‿])([^.⁀‿ˈˌ]*)([ˈˌ])", "%1%3%2") -- Suppress syllable mark before IPA stress indicator. word = rsub(word, "%.([ˈˌ])", "%1")

- Convert explicit character notations -

-- Convert explicit character notation to regular character. word = rsub(word, ".", explicit_char_to_phonemic)

return word end

-- This function contains the main rules that transform from phonemic IPA to phonetic IPA. This is called on the output -- of apply_phonemic_rules. -- -- FIXME: This needs work before being ready for release. local function apply_phonetic_rules(word) -- At the beginning of a word, glottal stop is mandatory before a vowel. word = rsub(word, "(⁀⁀)(" .. V .. ")", "%1ʔ%2") -- FIXME: Evaluate whether the following rules should apply across the ‿ symbol. -- -ken, -gen at end of word have syllabic ŋ word = rsub(word, "([kɡ]ə)n⁀", "%1" .. "ŋ⁀") -- IPA ɡ -- schwa + resonant becomes syllabic (written over ŋ, otherwise under) word = rsub(word, "ə([lmn])", "%1" .. SYLLABIC) word = rsub(word, "əŋ", "ŋ̍") -- coda r /ʁ/ becomes a semivowel word = rsub(word, "(" .. V .. "ː?)ʁ", "%1ɐ̯") -- unvoiced stops and affricates become affricated word-finally and before vowel, /ʁ/ or /l/ (including syllabic	-- /l/), but not before syllabic nasal; cf. zurücktreten, with aspirated 'z', 'ck' and first 't' but not second; -- also not before homorganic stop across component boundary like Abbildung [ˈʔap̚.b̥ɪl.dʊŋ] word = rsub(word, "p([.⁀]*b)", "p" .. UNRELEASED .. "%1") word = rsub(word, "t([.⁀]*d)", "t" .. UNRELEASED .. "%1") word = rsub(word, "k([.⁀]*ɡ)", "k" .. UNRELEASED .. "%1") -- IPA ɡ word = rsub(word, "([ptk])(%.?[⁀" .. vowel .. "ʁl])", "%1ʰ%2") word = rsub(word, "(t" .. TIE .. "[sʃ])(%.?[⁀" .. vowel .. "ʁl])", "%1ʰ%2") -- voiced stops/fricatives become unvoiced after unvoiced sound; cf. Abbildung [ˈʔap̚.b̥ɪl.dʊŋ] word = rsub(word, "(" .. unvoiced_C .. "[.⁀]*[bdɡvzʒ])", "%1" .. UNVOICED) -- IPA ɡ -- FIXME: Other possible phonemic/phonetic differences: -- (1) Omit syllable boundaries in phonemic notation? -- (2) Maybe not show some or all glottal stops in phonemic notation? Existing phonemic examples tend to omit it. -- (3) Maybe show -ieren as if written -iern; but this may be colloquial. return word end

local function generate_phonemic_word(word, is_cap) word = rsub(word, "(.)%*", char_to_explicit_char) -- local pos = default_pos(word, pos) local pos = nil local affix_type if word:find("^%-") and word:find("%-$") then affix_type = "interfix" elseif word:find("%-$") then affix_type = "prefix" elseif word:find("^%-") then affix_type = "suffix" end -- FIXME: Handling of prefixes/suffixes is currently broken. word = gsub(word, "^%-?(.-)%-?$", "%1") word = split_word_on_components_and_apply_affixes(word, pos, affix_type) word = "⁀⁀" .. word .. "⁀⁀"	word = apply_phonemic_rules(word) return word end

local function do_phonemic_phonetic(text, pos, is_phonetic) if type(text) == "table" then text = text[1] end local result = {} local words = canonicalize_and_split_words(text) for _, wordspec in ipairs(words) do		local word = generate_phonemic_word(wordspec.word, wordspec.is_cap) if is_phonetic then word = apply_phonetic_rules(word) end table.insert(result, word) end result = table.concat(result, " ") result = rsub(result, "[⁀‿]", "") if not is_phonetic then -- Remove explicit syllable boundaries in phonemic notation. (FIXME: Is this the right thing to do?) result = gsub(result, "%.", "") end -- It doesn't matter so much whether we return composed or decomposed Unicode except for unit tests, where the expected -- values are composed. return mw.ustring.toNFC(result) end

-- External entry point to convert arbitrary text to its phonemic IPA representation. function export.phonemic(text) return do_phonemic_phonetic(text, false) end

-- External entry point to convert arbitrary text to its phonetic IPA representation. function export.phonetic(text) return do_phonemic_phonetic(text, true) end

function export.show(frame) local parent_args = frame:getParent.args local params = { [1] = { required = true, default = "Aufenthalt>s-genehmigung", list = true }, ["pos"] = {}, ["ann"] = {}, }	local args = require("Module:parameters").process(parent_args, params)

local IPA_args = {} for _, arg in ipairs(args[1]) do		local phonemic = export.phonemic(arg, args.pos) local phonetic = export.phonetic(arg, args.pos) table.insert(IPA_args, {pron = '/' .. phonemic .. '/'}) if phonemic ~= phonetic then table.insert(IPA_args, {pron = '[' .. phonetic .. ']'}) end end

local anntext if args.ann then anntext = "'''" .. args.ann .. "''':&#32;"	else anntext = "" end

return anntext .. m_IPA.format_IPA_full { lang = lang, items = IPA_args } end

return export