Module talk:zh/data/yue-word

Cantodict data, can be used to add Cantonese pronunciations (~60000 words)

Wyang (talk) 03:11, 7 August 2014 (UTC)

Cantodata
How did you manage to extract it? Is Adam Sheik happy with it or did he give it himself to you? --Anatoli T. (обсудить/вклад) 23:58, 7 August 2014 (UTC)
 * CEDict data could also be used for Mandarin. They may be some occasional terms, which would not meet CFI. --Anatoli T. (обсудить/вклад) 00:01, 8 August 2014 (UTC)
 * I used the script at User:Wyang/code to extract it - it is still running and there are about 20,000 left (another 30 min). No, this doesn't have Adam Sheik's permission, which is why I only gave a snippet view of it. The pronunciation data (Jyutping) is uncopyrightable, and is potentially useful here. We can 1) use a bot to automatically add Cantonese pronunciations to existing Chinese entries; 2) make look up in this data whenever a new Chinese entry is created. Wyang (talk) 00:21, 8 August 2014 (UTC)
 * Awesome. Very good idea. Please consider extracting CEDict as well, the whole dictionary could be used to create Chinese entries here. WWWJDIC or Edic data could similarly be used for Japanese. --Anatoli T. (обсудить/вклад) 00:31, 8 August 2014 (UTC)
 * They are now uploaded: Special:PrefixIndex/Module:zh/data/Jyutping_word. Wyang (talk) 03:42, 8 August 2014 (UTC)
 * Thank you but for me it's easier to access the site directly. :) Not sure I'll personally be able to use these files. --Anatoli T. (обсудить/вклад) 12:42, 8 August 2014 (UTC)
 * I haven't incorporated these in the current infrastructure yet... Let me do the integration soon... Wyang (talk) 10:25, 9 August 2014 (UTC)
 * Both utilities are done:, . Wyang (talk) 01:16, 11 August 2014 (UTC)
 * Great stuff!. However, I am bit concerned with editors not familiar with Cantonese automatically generating both Mandarin and Cantonese just relying on automatic Pinyin and Jyutping generation. Perhaps the automatic Jyutping should be a parameter, e.g. c=y? (Mandarin could be disabled by m=n for Cantonese only entries?) Also, not sure how you used to update Jyutping in existing entries, such as you did . --Anatoli T. (обсудить/вклад) 01:34, 11 August 2014 (UTC)
 * I think the benefit of allowing automatic generation will be greater - it's quite hard to get it wrong. Both can be disabled by |m/c=-. I used the yue_for_bot function in Module:zh to do the latter. Wyang (talk) 01:41, 11 August 2014 (UTC)
 * I think I understand it now. It actually checks Jyutping for the whole word and adds it only when it can find? It's great then! --Anatoli T. (обсудить/вклад) 06:57, 11 August 2014 (UTC)

New additions
(I know Justin is on ...)

I incorporated the CC-Canto and CC-CEDICT Cantonese data into our existing Cantonese data today, so that now it has increased to >134,000 entries with pronunciation. Examples of the addition are this and this.

I'd love to know what you guys think. There are some errors in their data, especially in words with 多音字, and I'm not sure whether the new changes are worthwhile. There are two options, either we keep the new data and check all the incoming entries in Category:Cantonese lemmas to make sure the Cantonese pronunciation is correct; the other being we revert to the old revisions because the new data is too unreliable.

Wyang (talk) 13:50, 22 October 2017 (UTC)


 * I'm not sure either. Maybe we can wait and see. —suzukaze (t・c) 03:40, 23 October 2017 (UTC)


 * Most are good but not 100%. It seems the approach in both CC-Canto and CC-CEDICT is whatever works in Mandarin will work in Cantonese, even if a term is colloquial or even regional - the quality of the jyutping transliteration sometimes depends on the contributor who could have use a most frequent or random random and may not be noticed by others - native speakers or advanced learners. On the up side, Cantonese entries get checked at Wiktionary and it may not take a very long time for an error to be noticed. I don't have a strong objection but let's see what others think. An option is to add for each automated entry. --Anatoli T. (обсудить/вклад) 12:21, 23 October 2017 (UTC)


 * If we're going to tag CC-Canto data with attention we might as well tag all uses of zh-new&mdash;sometimes I don't notice problems with pinyin, sometimes the Hokkien data has peculiarities, etc.... I don't think that approach is practical. The CantoDict data isn't perfect either; corrections from  to   (the /j/ initial) have been made a few times. —suzukaze (t・c) 21:06, 23 October 2017 (UTC)


 * (Kinda late, but I'm back from the break!) I think the data should be fine. There are just some problems such as not indicating tone change and using full-space commas. Overall, it's probably as problematic as Cantodict is. — justin(r)leung { (t...) 04:20, 28 October 2017 (UTC)


 * : Ok, thanks all. We will just keep an eye on it for now, and I will run some regular checks to try to catch the errors (like User:Wyang/yue-char-pron). (Btw, glad to see you are, Justin!) Wyang (talk) 05:07, 28 October 2017 (UTC)

Could Kaifang Cidian data be added as well? —suzukaze (t・c) 04:30, 11 November 2017 (UTC)


 * Yes, I will add it. Wyang (talk) 04:55, 11 November 2017 (UTC)