Template talk:hu-IPA

Hi. I created this template as well as the module Module:hu-pron, for automatic generation of Hungarian pronunciations. The testcases are found at Module:hu-pron/testcases. The template seems OK and I hope it is useful. Please let me know if there are any problems or suggestions, thanks. Wyang (talk) 10:13, 27 October 2014 (UTC)
 * Thank you! This is going to be very helpful. I will test it and let you know if I find anything. --Panda10 (talk) 10:57, 27 October 2014 (UTC)

Below are a few bugs and a list of test cases. I will update them as we go. --Panda10 (talk) 17:54, 27 October 2014 (UTC)

Bugs

 * 1) megegyezik /ˈmɛgɛɟːɛzik/: code returns /ˈmɛgɛɟɛzik/
 * 2) csillagjóslás /ˈt͡ʃilːɒɡjoːʃlaːʃ/: code returns /ˈt͡ʃilːɒɟːoːʃlaːʃ/, gj never turns into ɟ.
 * 3) angyalka /ˈɒɲɟɒlkɒ/: code returns /ˈɒnɟɒlkɒ/
 * 4) barlangkutató /ˈbɒrlɒŋkutɒtoː/: code returns /ˈbɒrlɒŋkːutɒtoː/
 * 5) Spaces: csalódást okoz /ˈt͡ʃɒloːdaːʃtokoz/, code returns /ˈt͡ʃɒloːdaːʃt ˈokoz/

Thanks.
 * 1) Is it a non-orthographic exception derived from egy?
 * 2) 'gj' fixed.
 * 3) More nasal assimilations added (n -> ŋ/ɲ/ɱ/m).
 * 4) Geminate reduction also applied to 'nX:' and 'X:n' sequences.
 * 5) Should spaces always be discarded?

Please feel free to modify anything in any way. You guys are much more familiar with Hungarian than I am. :) Wyang (talk) 10:35, 28 October 2014 (UTC)
 * Thanks for the updates. I am not familiar with Lua, but I tried to understand the script as much as I could. I've made a few character-related changes. Below are my replies to your questions and I have added more:

Thanks for your help! --Panda10 (talk) 17:09, 28 October 2014 (UTC)
 * 1) Yes, it comes from egy.
 * 2) Thanks for fixing gj.
 * 3) Thanks.
 * 4) This is great.
 * 5) For two-word phrases (plus articles), the space can be discarded and the assimilation rules applied as needed. For longer phrases I don't normally provide IPA. Each word in the phrase is linked to the individual entry where the IPA is provided.
 * 6) New question: I added ccs and cs but I still can't get the correct result for meccs /ˈmɛt͡ʃː/. The script treats the first c as a separate letter and not part of ccs (long cs). Resolved.
 * 7) Can you please change the resulting brackets from [] to //?
 * 8) Could you add a word separator such as an * to indicate that the letters are to be treated separately? E.g.: házsor is a compound word. The zs in the middle is not the letter zs, but z + s and the IPA should be /ˈhaːʃːor/. The template call would be {hu-IPA|ház*sor}.
 * 9) The hiatus-filler is not always added, e.g. algériai /ˈɒlɡeːriʲɒʲi/, the code returns /ˈɒlɡeːriʲɒi/. --Panda10 (talk) 22:24, 31 October 2014 (UTC)

Sorry for the late reply. I think all of the above, as well as the new testcases2, are fixed. Input containing only one space will have the space removed. The word separator introduced is "#", and the two testcase pages now have a third respelling parameter. Please have a look and let me know what you think. (I guess a lot of #-containing combinations will have to be added.) Thank you! Wyang (talk) 08:01, 21 November 2014 (UTC)

I will add new test cases. Thank you again for all your help! --Panda10 (talk) 20:24, 22 November 2014 (UTC)

No worries. I have modified the code. Please have a look at the two testcase pages. There are 3 fails in testcases2, which I don't quite understand. Thanks! Wyang (talk) 12:24, 23 November 2014 (UTC)

On page Testcase: 'addig üsd a vasat, amíg meleg' and 'basszusgitár' - s becomes /ʒ/ before d and g. I noticed you corrected it to /ʃ/. See the small table in this appendix: Voicing and devoicing of consonants. On page Testcases2: I corrected the IPA for dzsesszzene (my mistake); rossz-szívű is probably an exception, the í can be pronounced short in normal speech. You can leave it as is. There will be always exceptions that the code cannot handle. I am still uncertain about balettjelenet. I hear a light ty /c/ before the j. Leave it as is for now. It is more important to code for the balk of the words and not for the exceptions. I will continue testing and checking. This is more involved than it appears at first glance. I am amazed how well you understood the rules. Do you speak Hungarian? And as always, thank you! --Panda10 (talk) 14:43, 23 November 2014 (UTC)

Thanks, the Appendix page is very helpful. Does the assimilation of 's' to /ʒ/ occur systematically? There are two examples in testcases2: 'fáklyászene' and 'sertészsír', which might also be affected. With 'tt#j', is it because 'tt#j' > 't:j' > 'cj' not 't#j' > 'tj'? What about 'tengeralattjáró'? I don't speak Hungarian... but I'm interested in languages, especially their evolution histories. Thank you! Wyang (talk) 21:44, 23 November 2014 (UTC)

The assimilation of s to /ʒ/ happens systematically before b, d, g, v, z, zs, dz, dzs, gy. Examples: 'városban' /ˈvaːroʒbɒn/, 'üsd' /ˈyʒd/, 'basszusgitár' /ˈbɒsːuʒɡitaːr/. And yes, 'fáklyás#zene' /ˈfaːkjaːʒzɛnɛ/ and 'sertés#zsír' /ˈʃɛrteːʒːiːr/. I corrected the IPA to tj for 'balett#jelenet', and the same would be for 'tengeralatt#járó'. I will be keep working on testing. Thanks. --Panda10 (talk) 23:04, 24 November 2014 (UTC)

Thanks! Wyang (talk) 23:44, 24 November 2014 (UTC)

Could you explain a bit more about /x/? Where does it occur as an allophone of /h/? Thanks Wyang (talk) 01:12, 27 November 2014 (UTC)

This is a can of worms. I am still working on the details to make them presentable for coding. The short answer is that /h/ has four allophones: /h/, /ɦ/, /x/ and /ç/. The /ç/ IPA symbol is used incorrectly according to several experts, but I can't locate the correct one in IPA charts. It's an x with an apostrophe on top, just like on é. I have a question, though. If I understand the code correctly, the current logic loops through each word several times. In each loop, a certain set of characters are replaced if found. For someone who doesn't know Lua, it can take a long time to decipher what happens to a specific letter combination. A less elegant but perhaps a more self-descriptive solution would be to list all the critical consonant variations with their appropriate IPA. Since this can be a very long list, I'm not sure how it would impact the speed of processing. But it might be easier to maintain. I'd like to know your thoughts. --Panda10 (talk) 18:22, 28 November 2014 (UTC)

Allophones of h
The /h/ has four allophones: /h/, /ɦ/, /x/, and /?/. The last two are approximately the same sound, /x/ is after back vowels, the other is after front vowels. The /x/ is formed in the back of the mouth, the other is slightly before it because of the nature of the vowels. They will be handled identically since the IPA symbol for the last one (an x with an apostrophy) is not available and the use of /ç/ would be incorrect here.

/h/

 * h at the beginning of words: haj, hód, henger
 * h after a consonant (but not after m, n, ny, l, j, r): kezdhet, áthúz
 * h or ch between two identical vowels: aha, uhu, pszichikai

/ɦ/

 * h or ch between two non-identical vowels: tehén, moha, achát
 * h or ch between a sonorant (m, n, ny, l, j, r) and a vowel (in this order): marha, porhó (por + hó powdery snow), monarchia

/x/

 * h or ch before a consonant: Ahmed, drachma, technika, jacht
 * h at the end of the word: Allah, doh, kazah, sah, potroh

/xː/

 * hh or cch between two vowels: Alahhal, sahhal, bacchánsnő (exception: ahhoz, ehhez)
 * ch at the end of the word: pech, almanach
 * ch between two vowels in a suffixed word where the root ends in ch: peches (this might be a tough one to code, it can be handled manually)

Comments

 * 1) There are a small number of words with a final silent h. The h is pronounced only if the suffix starts with a vowel in inflected forms:
 * céh, cseh, düh, juh, méh, oláh, pléh, rüh
 * 1) The ch is pronounced /t͡sh/ in compound words such as harminchat (harminc + hat thirty-six). These will be handled with the # separator.
 * 2) The ch is also pronounced /t͡sh/ in words ending in c and suffixed with –hoz/-hez/-höz: polchoz, perchez, teknőchöz. These could be included in the program because of the pattern (words ending in choz, chez, chöz).
 * 3) In Hungarian, the ch digraph can only be found in foreign words, so its pronunciation depends on the language of origin and it is not always /x/. Since these cases are unpredictable, they will have to be handled manually. Examples: /k/ charta, /kh/ echó, /ʃ/ Chagall, champagne, /t͡ʃ/ charter, machete.

Examples:
 * haj:
 * hód:
 * henger:
 * kezdhet:
 * áthúz:
 * aha:
 * uhu:
 * pszichikai:


 * tehén:
 * moha:
 * achát:
 * marha:
 * porhó:
 * monarchia:


 * Ahmed:
 * drachma:
 * technika:
 * jacht:
 * Allah:
 * doh:
 * kazah:
 * sah:
 * potroh:


 * Alahhal:
 * sahhal:
 * peches:
 * bacchánsnő:
 * ahhoz:
 * ehhez:
 * pech:
 * almanach:


 * cseh:
 * pléh:
 * harminchat:
 * polchoz:
 * perchez:
 * teknőchöz:

Hi. Thank you very much for the explanations. How do the above examples look? With regard to the idea of exhaustively listing all consonant combinations - I think it is a good idea, and I have partially implemented it. The code now looks for a string of consonants in the word, and tries to find it in the table replace_cons. If it can not be found, it will replace it using the four steps in replace_set. There might be too many possible combinations - How should complex clusters (like 'lcscs', 'dzscs', 'tszh') be handled logically? Wyang (talk) 00:33, 1 December 2014 (UTC)

Thank you for updating the module. The above examples look good. The word 'achát' should not need the phon parameter. The ch digraph between two non-identical vowels is /ɦ/. I think I created a conflict in the rules with ch. The allophones of /h/ and its rules were described in detail in a university phonetics text written in Hungarian (Forró Orsolya: Hangtan II, p33). The /ch/ digraph was not included in this text but I tried to use the same logic, thinking that this will work in those cases where the /ch/ is pronounced as an allophone of /h/. So to resolve the conflict, I removed the ch between two vowels from the paragraph /xː/. Please check out the changes I made using a diff. Re the consonant clusters: It's great that you added replace_cons. At least I can add new combinations as I test. I like that you also kept the replace_set. The letter combinations you mentioned: lcscs /lt͡ʃ/ kulcscsont (collarbone), dzscs /t͡ʃː/ bridzscsapat (bridge team), tszh /t͡sh/ játszhat (can play). I would also add szsz /sː/, this way 'fodrászszalon' would not need the separator. I will continue testing. --Panda10 (talk) 19:06, 1 December 2014 (UTC)

Thanks - I have modified intervocalic 'ch' accordingly and added 'szsz'. Please let me know what you think. Wyang (talk) 06:44, 2 December 2014 (UTC)
 * It looks good! Thank you. --Panda10 (talk) 22:03, 4 December 2014 (UTC)

I have a question: Does consonant degemination apply across syllable boundaries? e.g. idegklinika, balettjelenet. What are the rules there? Thanks Wyang (talk) 21:29, 3 December 2014 (UTC)
 * There are several rules for three-consonant clusters. For idegklinika (ideg + klinika): a voiced 'g' at the end of the first word is followed by its unvoiced pair 'k' at the beginning of the second word, then 'l' which does not impact the consonant before it. They become [kːl]. In cases like that, the degemination does not happen. So the general rule is that if a voiced-unvoiced (or unvoiced-voiced) pair is connected at the word boundary and they are followed by a third consonant that does not change the second one, the degemination will not occur. Similar words: átdrukkol (át + drukkol) /ˈaːdːrukːol/, megkvarcol (meg + kvarcol) /ˈmɛkːvɒrt͡sol/. Let me know if this does not make sense. Thanks. --Panda10 (talk) 22:03, 4 December 2014 (UTC)


 * átdrukkol:
 * idegklinika:
 * megkvarcol:

Thanks, fixed now. Wyang (talk) 03:54, 5 December 2014 (UTC)
 * Thank you! --Panda10 (talk) 14:30, 5 December 2014 (UTC)

Questions
I wanted to eliminate the need for # in cs+sz combinations (such as in kulcsszerep 'key role'), since it is always pronounced [t͡ʃs], so I added it to replace_cons. But it still does not work without the # separator, it produces [t͡ss]. Do I need to add it to replace_set, as well, just like you did with zs+sz? If yes, why? I thought that replace_cons and replace_set handles different cases. Thanks. --Panda10 (talk) 18:30, 5 December 2014 (UTC)
 * Yes, when the script finds a string of consonant letters it will match it in Lua first; if that fails it will sequentially replace that string using the steps in Lua. 'lcssz' is not in Lua, and therefore will be handled by the steps. It was my question above too: "There might be too many possible combinations - How should complex clusters (like 'lcscs', 'dzscs', 'tszh') be handled logically"? Wyang (talk) 02:30, 6 December 2014 (UTC)
 * I see, my mistake was adding only cssz and not lcssz to Lua. Sorry if I did not answer the 'lcssz', 'dzscs', 'tszh' question earlier a little more clearly. I think they should be added to Lua and I will do that. Yes, there could be a lot of combinations. What if there are 100 or more? Will it impact the performance? Thanks. --Panda10 (talk) 13:55, 6 December 2014 (UTC)
 * Thanks. Size is not a problem, hundreds or thousands should be fine. The hard part might be compiling such a long list. Wyang (talk) 23:12, 6 December 2014 (UTC)
 * Will Lua allow constraints in the Lua list? For example, a consonant cluster at the end of the word but nowhere else? Below I added the allophones of j where this would apply. Thank you for your patience with all this. --Panda10 (talk) 17:49, 11 December 2014 (UTC)

Allophones of j
The /j/ has three allophones: /j/, /ç/, and /ʝ/.

/ç/

 * After f, k, p at word ending: döfj /ˈdøfç/, rakj /ˈrɒkç/, kapj /ˈkɒpç/

/ʝ/

 * After b, v, g, r, m at word ending: dobj /ˈdobʝ/, óvj /ˈoːvʝ/, vágj /ˈvaːgʝ/, sarj /ˈʃɒrʝ/, szomj /ˈsomʝ/
 * Becomes /ç/ if followed by an unvoiced consonant due to the rules of assimilation: vágj ki /ˈvaːgçki/
 * Becomes /j/ if it is followed by a vowel: vágj oda /ˈvaːgjodɒ/

/j/

 * In every other case.


 * döfj:
 * rakj:
 * kapj:
 * dobj:
 * óvj:
 * vágj:
 * sarj:
 * szomj:
 * vágj ki:
 * vágj oda:


 * :). Wyang (talk) 01:46, 15 December 2014 (UTC)
 * Wow, this is awesome! I'm really hoping that this was the last major change and the rest I can handle just by updating Lua. Thank you for all your help and patience in dealing with the intricacies of Hungarian pronunciation! :)) --Panda10 (talk) 16:03, 15 December 2014 (UTC)
 * No worries. I will keep an eye on this discussion and other pages. Wyang (talk) 00:29, 16 December 2014 (UTC)

Multiple pronunciations
Hi This is not urgent at all, just a request for a functionality that would be nice to have. Occasionally there is a need to display multiple pronunciations in one line. For example, standard vs. dialectal (see utál), or loanwords that still do not have an established pronunciation (see orrspray). Currently, there are two workarounds: 1. Use multiple times, in a separate line for each variant. 2. Use once and separate each variant with a vertical bar. When your time allows, could you please take a look if the vertical bar separator functionality could be incorporated? Thank you. --Panda10 (talk) 15:55, 18 December 2014 (UTC)
 * Hi I have added multiple pronunciations inside . Please have a look at the code at orrspray. Are the pronunciations there correct and not for spray? For ones with notes, using the template multiple times with notes in  in front may be better. Thanks. Wyang (talk) 20:57, 18 December 2014 (UTC)
 * Thanks. I corrected orrspray. However, there is a module error on the testcases pages and the IPA symbols disappeared from every entry where is used. See épület. --Panda10 (talk) 21:06, 18 December 2014 (UTC)
 * Sorry for the late delay - fixed now. Wyang (talk) 04:03, 19 December 2014 (UTC)
 * Thank you for adding the new functionality so quickly and for letting me know about . :) --Panda10 (talk) 14:06, 19 December 2014 (UTC)

dzs

 * 1) Short at the beginning of the word: dzsungel /ˈd͡ʒungɛl/
 * 2) Long at the end of the word:  bridzs /ˈbrid͡ʒː/
 * 3) Short or long between two vowels: menedzser /ˈmɛnɛd͡ʒːɛr/, tinédzser /ˈtineːd͡ʒɛr/, büdzsé /ˈbyd͡ʒeː/, hodzsa /ˈhod͡ʒːɒ/, Kambodzsa /ˈKɒmbod͡ʒːɒ/
 * 4) Short between a vowel and a consonant: menedzsment /ˈmɛnɛd͡ʒmɛnt/
 * 5) Short between a consonant and a vowel: lándzsa /ˈlaːnd͡ʒɒ/
 * 6) Compound words 1: farmer|dzseki /ˈfɒrmɛrd͡ʒɛki/
 * 7) Compound words 2: nád|zsalu /ˈnaːdʒɒlu/ and not */ˈnaːd͡ʒːɒlu/

gysz

 * 1) Compound words: gyógy|szer, négy|száz, nagy|szerű

ttc

 * 1) Compound words: balett|cipő, felnőtt|csapat

th

 * 1) fordítható, látható
 * 2) Mikszáth

dh

 * 1) mondhat
 * 2) Compound words: rend|hagyó

hh

 * 1) ahhoz, ehhez, mind|ahhoz
 * 2) dühhöz, Allahhoz, cechet, cechhel
 * 3) Compound words: éh|halál, juh|hús

gys, gysz, gycs
I don't really understand why entries like, , need a # in the template. As far as I know, there is a regular assimilation in these words as per Appendix:Hungarian pronunciation assimilation, but they still need the hashmark. Could you please expand the documentation a bit so that it covers such cases? Thanks --Einstein2 (talk) 14:11, 27 July 2016 (UTC)
 * I made some changes in Module:hu-pron. I wonder why I did not notice this before. :(
 * Thanks for all your work! --Panda10 (talk) 14:43, 27 July 2016 (UTC)
 * Thanks a lot! --Einstein2 (talk) 14:58, 27 July 2016 (UTC)