Module talk:ko-translit/testcases

mekogakopon?
Hmm, don't what's happening but in megaphone the automatic (2nd) transliteration for "메가폰" (megapon) shows "mekogakopon"? Is another module used? Testing here: 메가폰, should be "megapon". In Module:ko-translit/testcases it passed OK:  -  --Anatoli (обсудить/вклад) 00:24, 19 September 2013 (UTC)
 * I have removed the module from Module:languages until this is resolved (메가폰 doesn't produce an erroneous transliteration now) but it did when the module was added to Module:languages. --Anatoli (обсудить/вклад) 03:55, 19 September 2013 (UTC)
 * Module:script utilities passes the language and script codes as the second and third arguments to the transliteration function, while this module interprets it as a syllable break. I moved the syllable break argument to the fourth position.
 * Which leads to another question: currently some of the testcases expect hyphens as syllable break characters in a few places, but this module will not generate them unless explicitly requested to put syllable breaks everywhere. Should these syllable breaks be forced in these cases too, or the testcases changed? Keφr 05:57, 19 September 2013 (UTC)
 * Thank you for fixing. I don't mind if prefix is left optional but the standard way should be without hyphens, IMHO. I don't think anyone will object. Do you mind if I re-add the module to Module:languages and continue testing and adding translations at the same time? I like that you disabled the module for hanja is detected, e.g. . --Anatoli (обсудить/вклад) 06:06, 19 September 2013 (UTC)
 * No, the module is hardly complete. There are some final consonant jamo which are currently not handled at all, see source code (two final consonants? it is enough to just plaster them together?). I would like to have some testcases with those. And of course with special-case transitions between those and the initial consonants...
 * And Hanja is just passed through as it is. Mixed Hangul-Hanja text will convert only the Hangul part. It may generate some funny output whereas a text is transliterated into itself. Perhaps I could just return nil when the script is set to "Hani"/"Hans"/Hant", or when . Keφr 06:21, 19 September 2013 (UTC)
 * Which finals do I need to test, what do you mean by "two final consonants"? --Anatoli (обсудить/вклад) 06:25, 19 September 2013 (UTC)
 * I have thrown in a random fragment of the "Korean Syllables" block. Keφr 06:28, 19 September 2013 (UTC)
 * I see. Do you want me to write some tests with expected results? I'll do more thorough checks later. --Anatoli (обсудить/вклад) 06:41, 19 September 2013 (UTC)
 * Yes, pretty much. Thank you. Keφr 06:43, 19 September 2013 (UTC)


 * I've changed my mind about hyphens. Korean Wiktionary uses them and this will make the transliteration much more readable, especially with particle and compound words, easier for learners to see the breakup by syllables. Could you make it default? --Anatoli (обсудить/вклад) 02:27, 20 September 2013 (UTC)
 * Surely. However this seems to go against current practices (few translations or entries here use hyphens), and kind of feels excessive.
 * Also, about the double-consonant finals: why "값" should be "gapt", while "값이" should be "gaps-i"? And why is "값" not "gaps" like the inter-syllabal transition table would suggest? (ko:값 gives us "gaps".) And could you find actual words with these syllables, instead of dumping the Unicode "Hangul Syllables" block? Keφr 06:22, 20 September 2013 (UTC)


 * Current practices include capitalisation and hyphens when they are logically appropriate - compound words, particles, attached copulas, possessive endings, etc. All these will disappear with automation and will make the text less readable. I will defend this point, if there are opponents, which doesn't contradict any existing policies. I think it's a good compromise if we can start relying on automated, not manual transliteration.
 * The rule for final clipped consonant is the same, even if they are combined. "값" is "gapt" but "값이" is "gaps-i" because of the following vowel. It's 100% "gaps-i" but I'm still looking for confirmation for "gapt" (the actual pronunciation is "kap"). These combinations are not too common, so I chose Unicode blocks to get to marginal cases. Anyway, "값" is a real word. Korean Wiki is not always right. :) --Anatoli (обсудить/вклад) 06:38, 20 September 2013 (UTC)
 * I have asked two people who know Korean about "값". Also, not sure about cases like ..."말)로" with punctuation between Hangeul syllables. Ideally, it should be "mal)lo" but I don't know if it's hard. --Anatoli (обсудить/вклад) 06:43, 20 September 2013 (UTC)
 * All right, I got confirmation from a knowledgeable Korean for "값". It's "gapt". Please note new cases and questions. In 갏의 it's "r" because "ㅎ" is silent around "ㄹ" --Anatoli (обсудить/вклад) 07:13, 20 September 2013 (UTC)


 * I gotta go, but "갏의" is probably "galh-ui", even if "r" is pronounced, sorry or the confusion. Will double-check later. --Anatoli (обсудить/вклад) 07:37, 20 September 2013 (UTC)
 * Okay. Спокойной ночи. Keφr 07:39, 20 September 2013 (UTC)
 * Nie, jeszcze nie śpię, ale będę trochę zajęty ;) --Anatoli (обсудить/вклад) 09:03, 20 September 2013 (UTC)


 * I have found this page helpful for Korean transliteration: http://www.indiana.edu/~korean/K101/Pron_rules.html#Tensification —Stephen (Talk) 13:40, 20 September 2013 (UTC)


 * Thank you for your participation, Stephen. It's not so much about pronunciation, there are too many transliteration standards for Korean but we'd like to make this module to follow Revised Romanization of Korean as close as possible. I'm currently busy but I'm going to change many of your test cases. --Anatoli (обсудить/вклад) 03:48, 21 September 2013 (UTC)
 * I have corrected user cases according to what I think matches RR. I can't find a more comprehensive description on how to transliterate some of the combinations using RR standard but using some logic and assumptions on what they should be transliterated as. Happy to be dissuaded. --Anatoli (обсудить/вклад) 00:02, 23 September 2013 (UTC)

Some combinations
According to user Russ (Korean Wiktionary) 갌, 값 and 갅 are gals, gaps and ganj. I thought they should be galt, gapt and gant. Do you agree with his suggestions? It is a bit hard because there is no comprehensive guide for RR. --Anatoli (обсудить/вклад) 04:30, 23 September 2013 (UTC)
 * I had the following discussion with one of the most active editors in the Korean Wiktionary here. He has been transliterating Korean entries for years using RR standards. I think we can use his judgement for the above, unless we have some other cases. Any objections? --Anatoli (обсудить/вклад) 05:41, 23 September 2013 (UTC)

Larger test (from Harry Potter)

 * 해리포터의 마법사의 돌 - 상
 * 제 1장 살아남은 아이
 * 프리벳가 4번지에 살고 있는 더즐리 부부는 자신들이 정상적이라는 것을 아주 자랑스럽게 여기는 사람들이었다. 그들은 기이하거나 신비스런 일과는 전혀 무관해 보였다. 아니, 그런 터무니없는 것은 도저히 참아내지 못했다.
 * 더즐리 씨는 그루닝스라는 드릴 제작 회사의 중역이었다. 그는 목이 거의 없을 정도로 살이 뒤룩뒤룩 찐 몸집이 큰 사내로, 코밑에는 커다란 콧수염을 기르고 있었다. 더즐리 부인은 마른 체구의 금발이었고, 목이 보통사람보다 두 배는 길어서, 담 너머로 고개를 쭉 배고 이웃 사람들을 몰래 훔쳐보는 그녀의 취미에는 더없이 제격이었다.
 * 더즐리 부부에게는 두둘리라는 어린 아들이 하나 있었는데, 그들은 세상 어디에도 두들리처럼 착한 아이는 없다고 생각했다.
 * 그런데 부족함이라고는 전혀 없는 더즐리 부부에게는 누구에게도 알리고 싶지 않은 비밀이 하나 있었다. 그건 포터 부부에 관한 것이었는데, 혹시 누구라도 포토 부부에 대해 알아낸다면 더즐리 부부는 아마 도저히 견딜 수 없을 것이다. 포터부인은 더즐리 부인의 동생이었지만, 그들은 몇 년째 서로 만난 적이 없었다.
 * 사실 더즐리 부인은, 자신의 여동생과 그 엉터리 같은 동생남편이 더즐리 집안에 전혀 어울리지 않는 부류라고 생각했기 때문에 동생이 없는 것처럼 행동했다.
 * 더즐리 부부는 포터 부부가 갑자기 이 근처에 나타나면 이웃 사람들이 뭐라고 떠들어댈지 생각만 해도 몸서리가 쳐졌다. 더즐리 부부는 포터 부부에게도 아들이 하나 있다는 것은 알고 있었지만, 본적도 없었다. 이 아이는 더즐리 부부가 포터 부부를 멀리하는 또 다른 이유이기도 했다. 그들은 두들 리가 그런 아이와 어울리지 않길 바랐다.
 * 하늘에 구름이 잔뜩 끼었다고 세상에 금방 기이하고 신비스런 일이 일어나는 것은 아니지만, 더즐리 부부가 잠에서 깨어난 그 우중충하고, 흐린 화요일에 우리의 이야기는 시작된다.
 * 더즐리 씨는 전형적인 직장인 풍의 무미건조한 넥타이를 매고 콧노래를 흥얼거리며 출근 준비를 서둘렀고, 더즐리 부인은 악악 울어대는 두들리를 힘겹게 아기용 의자에 앉히며 신나게 남의 험담을 늘어놓았다.

--Anatoli (обсудить/вклад) 12:56, 23 September 2013 (UTC)


 * Concern: 없다 is transliterated as "eopda", not "eops-da". I think the second consonant in double consonant finals should be dropped when the next syllable starts with a consonant. Have to think about it. --Anatoli (обсудить/вклад) 13:06, 23 September 2013 (UTC)
 * Or maybe transliterate it with a single consonant by default, and add the other back in case the next syllable starts with an ieung, we encounter a word break or the end of the string. Keφr 14:19, 23 September 2013 (UTC)


 * Not a concern any more. User:Russ transliterates it as "eops-da". RR shows all consonants, even if they are not pronounced. The module looks good. Well done! --Anatoli (обсудить/вклад) 22:31, 23 September 2013 (UTC)
 * I am not sure. There is some discrepancy in use of r/l between this module and what ko:User talk:Russ wrote. Is that a non-concern? Keφr 23:09, 23 September 2013 (UTC)
 * No concern. He made a few typos, not just in r/l but a/ae because he typed by hand. r/l rules are straightforward. --Anatoli (обсудить/вклад) 23:22, 23 September 2013 (UTC)

Confusion between transcription and transliteration?
The hyphen must be used if and only if the zero initial (ㅇ) follows another hangeul. 직접 must be transliterated to jigjeob, not to jik-jeop. There are some errors in the expected forms: 있습니다 must be issseubnida, not it-seum-ni-da. Isn’t there a confusion between transcription and transliteration perhaps? See fr:Annexe:Romanisation du coréen, fr:Modèle:ko-roman, and fr:Modèle:ko-translit if you are interested. — T AKASUGI Shinji (talk) 06:13, 24 September 2013 (UTC)


 * Shinji, hyphens are used in the Korean Wiktionary. This is an automatic module, so it won't be able to determine where particles, copulas and compound words are. It will only be used if there is no manual transliteration. I suggest to have a compromise, as automatic transliteration won't be able to do capitalisation either and will have other limitation. There are too many systems, we should stick to one. I have confirmed most cases with ko:User:Russ @Korean Wiktionary. It is a practical transliteration, with some phonetic features as in Revised Romanization of Korean, which will not necessarily match other methods, for example, the French module uses "hangugmal" (한국말) and "gachi" (같이). It should be "han-gung-mal" and "gat-i" in RR.
 * I agree about it-seum-ni-da, it should be changed to "is-seum-ni-da", though. I'll add it to the test cases, hopefully, Kephir will be able to fix it. --Anatoli (обсудить/вклад) 06:28, 24 September 2013 (UTC)
 * Now I’m sure that you confuse the two. Hangungmal, gachi, and itseumnida are correct transcriptions, and Hangugmal, gat-i, and issseubnida are correct transliterations. Generally we need the former but not the latter. The French templates (not modules) display them correctly. — T AKASUGI Shinji (talk) 08:12, 24 September 2013 (UTC)


 * The Wikipedia page above may help to unconfuse and decide what to call it - transcription or transliteration. There's no perfect transliteration scheme but RR is both official and standard. They used features both phonetic (like Hangungmal) and literal (and gat-i). --Anatoli (обсудить/вклад) 12:17, 24 September 2013 (UTC)
 * Read 로마자 표기법 carefully. The articles 1-7 describe transcription and the article 8 describes transliteration. They are officially defined and there is almost no ambiguity. There are also examples. — T AKASUGI Shinji (talk) 15:28, 24 September 2013 (UTC)
 * Shinji, my Korean is poor but anyway, what standard do you propose? Does it have a name? Can you write out your proposal? RR is the latest and is a standard in South Korea, Yale and McCune–Reischauer have their own flaws. RR standard is feasible and is almost done. If you noticed, many transliteration standards in Wiktionary are not completely phonetic and they don't have to imitate IPA. Besides, not all standard can be liked by everybody. It seems lost interest in the project, I don't know. --Anatoli (обсудить/вклад) 05:37, 2 October 2013 (UTC)
 * Yes, RR is the standard. But RR has two ways of Romanization, as shown in the table above. Avoid using the term transliteration because it is not what we are going to use. We should stick to transcription, which is explained officially in the link above. The Romanizations of 한국말 and 같이 are consequently Hangungmal and gachi. — T AKASUGI Shinji (talk) 00:37, 10 October 2013 (UTC)
 * Perhaps you can take over and write test cases if is still interested. I'd need transliteration written out in English to be able to contribute better. Judging by the English Wikipedia page I would suggest "gat-i" (it doesn't describe all cases in great detail, though) despite the pronunciation, as it is closer follows the spelling (knowledge of the Korean phonology is sufficient to know how to pronounce Hangeul or its Roman presentation, whatever you want to call it). The rules for changing "t" to "ch" is not mentioned there. Hangungmal can be deducted from the table in the Wikipedia, I am not arguing that point. Is the Korean page really official? All transliteration modules use transliteration, not transcription because it's easier and more standard. I don't know why we need to make exception for Korean. It's quite complicated as it is. --Anatoli (обсудить/вклад) 01:02, 10 October 2013 (UTC)
 * The page is a part of the official website of the National Institute of the Korean Language. And it is clearly written, as an example of palatalization, that 같이 is gachi. Search the word in the page. The English Wikipedia page doesn’t fully cover the rules. — T AKASUGI Shinji (talk) 04:55, 10 October 2013 (UTC)
 * OK, I have changed the test case from "gat-i" to "gach-i" (let's discuss the use of hyphens at a later stage, it's easy to change back and forth). Why is "있습니다" transcribed as "itseumnida", not "issseumnida"? What are other differences? --Anatoli (обсудить/вклад) 05:11, 10 October 2013 (UTC)
 * There are only seven codas: p, t, k, m, n, ng, and l. In 있습니다, ㅆ is neutralized with ㄷ, and romanized to t. On the other hand, in 있어요, ㅆ is an onset and romanized to ss. — T AKASUGI Shinji (talk) 05:28, 10 October 2013 (UTC)
 * My Korean is even poorer, so I would rather wait for you two to sort this out. I guess what Shinji is saying is that there are two methods of converting Hangul into the Latin alphabet, one of them being a simple mapping of jamo into Latin letters ("transliteration"), and the other takes the actual pronunciation into account ("transcription"), and that we should use the latter. (Which I am somewhat sympathetic towards, since the latter is, I presume, harder to learn/achieve but more useful. Learning an alphabet is easy.) And no, our current Cyrillic modules for example actually do "transcription", otherwise Module:ru-translit would not be special-casing "-его" as "-evo", or choose to transcribe "е" as "je" or "e" depending on the preceding letter. Keφr 05:59, 10 October 2013 (UTC)
 * Thanks for finally responding.
 * Yes, Cyrillic is a combination of both but it's very far from transcription., and  are "molokó", "zub" and "čéstnyj", not "malakó", "zup" and "čésnyj" as they are pronounced. "е" as "je" or "e" is a case when letters have different values in certain positions, same as with Korean finals, which you have already implemented. I always stressed that we use a combination of both methods. And in Russian, we also use manual exceptions (e.g. symbol ɛ), which we probably won't need in Korean. (In Japanese, particles は and へ are transcribed as "wa" and "e", not "ha" and "e", and "おう" (or other letters in -o row) can be both "ou" and "ō". So, there are exceptions in Japanese, where manual transcription is preferred.)
 * If you feel that the transcription method is feasible and Shinji is willing to take part (maybe by checking the Harry Potter passage above), then I'll follow. I just don't have enough material in English to add more cases.
 * I've changed the test case to "it-seum-ni-da", which now passes. --Anatoli (обсудить/вклад) 06:11, 10 October 2013 (UTC)

Template:ko-pron already takes into account this distinction.

Wyang (talk) 06:21, 10 October 2013 (UTC)

RR itself and the McCune-R are transcription schemes, whereas Yale and "RR transliteration" in that template are transliterations. A new page-creating script I wrote (Template:ko new) generates all four from Hangeul automatically, although I forced all four to transliterate -C+V- sequences as "C-V". It's unclear what this page wants to achieve - it would be better if its purpose is specified in the module first. Wyang (talk) 06:21, 10 October 2013 (UTC)


 * So, test cases need to change -lp, -lt, -pt and other codas? Are 갋, 갌, 갍, 갎, 갏 are all "gal" in the final position? And 값 is "gap"? Please check some test cases, if you can. --Anatoli (обсудить/вклад) 06:32, 10 October 2013 (UTC)


 * What does this page want to do? Is it supposed to follow the official RR? The testcases are a mix of differently transcribed/transliterated examples. Wyang (talk) 06:41, 10 October 2013 (UTC)


 * The purpose is to use in automatic transliteration of Korean terms, the way other transliteration modules are used. Whether this will replace manual transliteration is debatable but there are many terms in translations, usage examples and entries without transliteration. It doesn't need a few, just one single method everybody agrees on. It's supposed to be the official RR but as Shinji TAKASUGI suggested RR has transcriptions (also official) as described in the Korean link above and the French templates. This "mix" needs to be sorted out. (BTW, your Korean template needs a usage example.) --Anatoli (обсудить/вклад) 06:52, 10 October 2013 (UTC)


 * Do you have a complete template that follows the official RR? --Anatoli (обсудить/вклад) 07:01, 10 October 2013 (UTC)

(Incomplete) Wyang (talk) 07:19, 10 October 2013 (UTC)


 * Would be sad if the work on the module died. Could you, Wyang and Shinji, add/fix test cases? @Wyang, could you show the usage of your template, so that we could run comparisons and resolve differences when they arise? I'm currently busy with numerous Russian entry issue but I will get back this. --Anatoli (обсудить/вклад) 23:48, 22 October 2013 (UTC)

has presented a new tool - 가방, which generates RR transcritption. 가방 produces "gabang". --Anatoli (обсудить/вклад) 03:28, 11 November 2013 (UTC)

Suppressing hanja transliteration
Hanja on its own should probably be suppressed (with brackets), not sure how to handle cases like, when (自轉車) is shown in brackets. What do you think? --Anatoli (обсудить/вклад) 23:59, 14 April 2014 (UTC)
 * Thanks. Is it possible to remove brackets, if the transliteration is empty? --Anatoli (обсудить/вклад) 00:05, 15 April 2014 (UTC)


 * I have made the transliteration disappear... To make the brackets disappear you probably have to do some script-detection magic with Module:links, and I don't know how that module works. Wyang (talk) 00:06, 15 April 2014 (UTC)
 * Thanks for doing it quickly. I'm not even 100% sure if what I asked was correct, that's why I asked for your opinion first. Mixed text, e.g. will produce just "hada". I might follow it up GP. Perhaps, hanja should be linked differently in translations, using a template similar to  or . I've just made, so  is a temp workaround for  but it doesn't link to ko:wikt. --Anatoli (обсудить/вклад) 00:29, 15 April 2014 (UTC)
 * could you suggest something? Can we remove brackets, if there is no transliteration, like in case of hanja above? --Anatoli (обсудить/вклад) 00:32, 15 April 2014 (UTC)
 * All the standard templates support  to suppress the transliteration. I don't know if it's documented, though.  00:37, 15 April 2014 (UTC)


 * , thanks. works., I think the rule would be different. If translit.=source (hanja, Roman, letters, numbers and no hangeul) then use tr=-, otherwise do full transliteration, including . So,  should give "加hada" but ,  - nothing without brackets (ea is now included as a Korean term). The last test case on this module now fails (한국어(韓國語)는 주로 한반도에서 쓰이는 언어로 ...) but it shouldn't, IMO, it should include hanja in the translit because it's a mixed text. --Anatoli (обсудить/вклад) 00:51, 15 April 2014 (UTC)


 * How about now? 1) Remove '(Hanja)' - pure Hanja in brackets; 2) If text contains Hangul, then unchanged; otherwise, no transliteration. Wyang (talk) 00:56, 15 April 2014 (UTC)


 * Thanks, now it handles mixed text well but it doesn't remove brackets:
 * Expected - nil.
 * Expected - nil.
 * Expected - nil.
 * Expected - (5ea jeongdo). currently OK
 * Expected - (加hada). currently OK
 * Expected - (hada). currently OK
 * To clarify - if source == target then nil (tr=-) else translit including non-hangeul symbols. Do you agree? --Anatoli (обсудить/вклад) 01:03, 15 April 2014 (UTC)


 * You probably have to ask about how to automatically disable transliteration for input lacking '[가-힣]'... Module:ko-translit only governs the transliteration inside the brackets. Wyang (talk) 01:10, 15 April 2014 (UTC)
 * The empty brackets are probably occurring because Module:ko-translit is returning empty text. In Lua, empty text isn't actually "empty", unlike for templates (in boolean terms, the empty string "" is true, not false). The module should return  instead if it's not returning a transliteration.  01:16, 15 April 2014 (UTC)
 * Thanks. Fixed now. Wyang (talk) 01:20, 15 April 2014 (UTC)
 * Thank you, both! --Anatoli (обсудить/вклад) 01:26, 15 April 2014 (UTC)

Jamoᅢᅥ, etc. = ae'eo
Should digraphs be split by syllables using - or ', e.g. when ᅢ and ᅥ are together, so that we get "ae'eo" or "ae-eo" instead of "aeeo"? There are a few other confusing vowel clusters, like "eeo", "oeo", "aeo", etc. Are there any official rules on these? --Anatoli (обсудить/вклад) 05:19, 15 April 2014 (UTC)


 * E.g. "taeeonal ttaebuteo jayuroumyeo" -> "tae-eonal ttaebuteo jayuro-umyeo" where 태 + 어, 로 + 우 are different syllables. --Anatoli (обсудить/вклад) 05:24, 15 April 2014 (UTC)


 * The null_final + null_initial combination is transliterated first as '-', and then kept for digraphs which could potentially cause confusion (oe, eo, ae, eu, ui) and removed for other cases. This official page on rules did not cover this in detail. Wyang (talk) 05:25, 15 April 2014 (UTC)
 * If I understand you correctly, then we should have "-" between "ae" and "eo", "o" and "u", e.g. in 태어날, 자유로우며 (with hyphens: tae-eonal, jayuro-umyeo) because they are such cases - null_final + null_initial. I think they would be more readable, even if we get more hyphens. What do you think? --Anatoli (обсудить/вклад) 05:33, 15 April 2014 (UTC)
 * I meant that only hyphens in 'e-o', 'o-e', 'a-e', 'e-u', 'u-i' are kept. I don't know if it is useful to extend the use of the hyphens. Sequences like 'aeeo', 'roum' or 'juet' cannot be really misinterpreted, I think. Wyang (talk) 05:40, 15 April 2014 (UTC)
 * OK, thanks. --Anatoli (обсудить/вклад) 05:47, 15 April 2014 (UTC)