Module talk:languages


 * Archives
 * 2013–2018

redundant grc special casing
In, line 258. It looks like this can be removed safely, since  always transforms back to canonical form. ? – Jberkel 22:40, 22 March 2019 (UTC)
 * Quite right. — Eru·tuon 22:48, 22 March 2019 (UTC)
 * And what's up with the "ar" part just above? Can't that be encoded in the data modules? – Jberkel 23:31, 22 March 2019 (UTC)
 * I couldn't think of a good way to do it at the time. Maybe you would have an idea. — Eru·tuon 23:34, 22 March 2019 (UTC)

makeEntryName
The regex at the beginning of  should probably be adjusted so it doesn't prevent linking to ¡ ! and ¿ ?, e.g.  ends up empty. – Jberkel 21:23, 26 April 2019 (UTC)
 * I suppose an overriding rule could be that if you do the replacements and end up with nothing but whitespace, then don't do any replacements. —Rua (mew) 21:50, 26 April 2019 (UTC)
 * Added to the testcases. As to the pattern, one idea I have, which seems to work in the sandbox, is to leave the punctuation marks if the string doesn't contain one non-whitespace non-punctuation character. — Eru·tuon 21:58, 26 April 2019 (UTC)
 * Your suggestions sound straightforward. Before making any changes I'd like to add a few more tests, the regex is long and only some cases are covered. – Jberkel 21:34, 29 April 2019 (UTC)
 * This works now (thx Eru·tuon). – Jberkel 22:36, 16 July 2019 (UTC)

Edit request for Nahuan languages
The ancestor for Nahuan languages should be changed from  to. This applies to azd, azn, azz, naz, nch, nci, nhe, nhg, nhi, nhm, nhn, nhq, nhw, nhx, nhy, nhz, nlv, npl, nsu, nuz, and ppl. --Lvovmauro (talk) 08:42, 26 August 2019 (UTC)
 * Actually, it would solve the problem to just remove the  field because all these languages have   as their family, and the common ancestor of   is  . — Eru·tuon 17:19, 26 August 2019 (UTC)
 * I've done that. — Eru·tuon 17:28, 26 August 2019 (UTC)
 * I did not do this for all the Nahuan languages, only the ones you listed (so there are still descendants of ); is that correct, or should all be direct descendants of  ? — Eru·tuon 17:30, 26 August 2019 (UTC)


 * Oops. It's supposed to apply to all of them. --Lvovmauro (talk) 03:25, 27 August 2019 (UTC)
 * Okay, I've done it to the rest of them, so there are no descendants of  shown in the language tree anymore. — Eru·tuon 04:09, 27 August 2019 (UTC)

Languages missing
While working with the Improved-xte.js I found the following languages that is missing from our language module: "Jawi" : "null", "Rumi" : "null" "Syriac" : "null", "Skolt" : "null", "Southern Skolt" : "null", "Southern" : "null", "Devanagari" : "null", "Pite" : "null", "Mecayapan" : "null",
 * Mymensinghiya: পসৰ/পসর (pośor) (light/translations dialect of Bangla see https://duckduckgo.com/?q=%22Mymensinghiya%22+639-1&ia=web)
 * "Jawi Malay" : "null",
 * "Rumi Malay" : "null",
 * "Syriac Hebrew" : "null",
 * "Skolt Northern" : "null",
 * "Devanagari Pali" : "null",
 * "Pite Northern" : "null",
 * "Mecayapan Classical" : "null",

False positives could exist in this list because it was generated by a script.--So9q (talk) 11:52, 1 October 2019 (UTC)
 * We have Category:Pite Sami language and Category:Skolt Sami language, which some of those are presumably dialects of. —Rua (mew) 12:49, 1 October 2019 (UTC)


 * "Jawi Malay" is just Malay written in the Arabic script (see Category:Malay terms with Jawi spelling), and "Rumi Malay" is Malay is written in the Roman alphabet. Likewise "Syriac Hebrew" and "Devanagari Pali" are just languages written in particular scripts.
 * Category:Mecayapan Nahuatl language and Category:Classical Nahuatl language are two different languages, there is no "Mecayapan Classical". --Lvovmauro (talk) 01:18, 2 October 2019 (UTC)
 * Thanks for taking time time to comment on this. I now found that only the first one in my list is a missing one. Could we add it and perhaps invent our own language code for it?--So9q (talk) 06:03, 2 October 2019 (UTC)

South Jutish
Hi, I would like to be able to distinguish between Jutish and South Jutish (see https://en.wikipedia.org/wiki/South_Jutlandic) as they are two different dialects of Danish. The former is in the module with "jut" and the latter does not have an iso code it seems. Could we add it with the code jut-sj (sj as initials for South Jutish). It should also be added to the translation-adder-data and nested under Danish.--So9q (talk) 10:33, 8 October 2019 (UTC)

Some Kipchak languages have Kipchak set as ancestor, some not.
While Bashkir and Tatar and Kazakh, Karakalpak, Siberian Tatar and Nogai and Kyrgyz have Kipchak (qwm) set as ancestor, Karachay-Balkar (krc), Kumyk (kum), Karaim (kdr), Krymchak (jct), Crimean Tatar (crh), Urum (uum) have not, although especially these are understood under the “Cuman” language which is rightly listed as a synonym in the data for the Kipchak language. They are currently unassigned. They belong to this Kipchak language. Note also that there is a redundant “Mamluk-Kipchak language” (trk-mmk) that is the very same Kipchak language (if one creates Kipchak entries, one creates them from Mamluk sources as well as more northerly sources), so it must be removed as a synonym. Another synonym is Middle Kipchak: This Kipchak language is the mutually intelligible lingua franca of the and had some literary use in Egypt as some Kipchaks happened to have grabbed power there particularly in the, hence also there is the name of “Mamluk-Kipchak” for a certain type of source texts (particularly teaching material written for Arabic readers), however nobody has ever thought about a distinct language there. All those modern derive from Kipchak like modern  derive from  (while Proto-Mongolic is reconstructed for just the time before Chengis Khan), it was a koiné under which for this period one cannot discern distinct Kipchak languages: the era given under Kipchak language “11th–17th century” is right, and contemporary Arabs also distinguished only Kypchak from Turkmen. It’s also like we have German, Alemannic German, Yiddish etc. descending from Middle High German. Just saying it explicitly as some admin already has to ask himself why qan has module errors. The inheritance statements there are correct, but the module data is incomplete. . Note that we already list all these languages in descendant lists under Kipchak; we also list Southern Altai together with Kyrgyz there, however the taxonomy of Southern and Northern Altai is controversial, some treat it as Kipchak but we can keep it as Siberian, it’s probably transitionary. Note another synonym of Southern Altai: Oirot language (ойро́тский). Fay Freak (talk) 16:06, 19 October 2019 (UTC)

Missing Chippewa language (ciw)
There doesn't seem to be any trace of it on English wiktionary, but French wiktionary and English Wikipedia recognize it. Why? Kevlar67 (talk) 19:32, 11 December 2019 (UTC)
 * It looks like it's subsumed under Ojibwe. — Eru·tuon 19:41, 11 December 2019 (UTC)
 * I found some related discussions in a search for :, . It is also mentioned in Language treatment. Pinging User:-sche, who participated in the first discussion and mentioned that it might make sense to merge other varieties of Ojibwe into  . — Eru·tuon 20:01, 11 December 2019 (UTC)
 * It's too bad Stephen G Brown has not been around lately, as he was one of the few users with any direct familiarity with the language. Yes, ciw was subsumed into oj on the grounds of being too similar to be sensible to treat as a separate language (I don't recall exactly when — possibly before my time here), and it probably would be sensible to merge some of the other codes, but I'd prefer to get input from speakers/people who would actually add any significant amount of Ojibwe content, about how they'd like to add it. (After all, our Norwegian editors prefer to keep that language's British-vs-American-English analogues under entirely separate headers and codes...whereas, Chinese offers a model of successful unification...) - -sche (discuss) 22:49, 11 December 2019 (UTC)
 * I don't know anything about these languages either, but I do think it's odd that all of the other languages subsumed under  get to be listed as distinct languages, but   doesn't. —Mahāgaja · talk 11:45, 12 December 2019 (UTC)

package.loaded
Just a note about this edit that you reverted: setting  will cause weird effects if another module tries to load the sortkey module after this point in the module invocation. It sets what  returns (see this sandbox demonstration). It would work to store sortkeys in a different table (which I wrote out as an edit before discovering you'd reverted).

It looks like memoizing sortkeys would reduce the number of times that a sortkey-making function is called in at least (because it emits categories after each label parameter that has categories associated with it), but not with most templates that just emit one set of categories at the end of the template output, because the memoization only survives for the time of one module invocation. I'm not sure whether memoization is worth it, because memoization adds a table but removes some executions of the sortkey function.

Another note is that memoization as initially implemented assumed that the sortkey function is called on a single title in a given module invocation (and that the script detection function yields one script code for a given title and sortkey module). This is probably an accurate assumption on Wiktionary at the moment, but it will break code outside of Wiktionary that evaluates a single instance of the module and uses it across many pages. This is fine for Wiktionary but I will have to keep this in mind if I use sortkey generation code outside Wiktionary. — Eru·tuon 02:41, 29 November 2022 (UTC)


 * @Erutuon Thanks! Yes - it seems to be causing something strange with Module:zh-sortkey that I can't seem to figure out - though I'm not sure that it's right that memorisation only survives until the end of the invocation, as this seems to be how mw.loadData ensures that data survives between multiple invocations (though I may be misunderstanding it, as the code isn't very clear to me). The fact that the Chinese module was throwing errors only on the second invocation suggests I'm onto something, though. Theknightwho (talk) 02:50, 29 November 2022 (UTC)
 * stores the tables returned by data modules so that they are only executed once per page parsing (but you can't directly access or modify the stored table), and it prevents the data module from seeing anything from a template invocation while it is being executed.  executes modules once per module invocation (or more times if you set   before calling   another time in a module invocation). Both can't pass any data between module invocations, and that includes tables containing memoized sortkeys. — Eru·tuon 03:35, 29 November 2022 (UTC)

remove_exceptions
I don't really understand what this is supposed to do, but the loop doesn't have an effect: The loop doesn't have an effect because  is a local variable. To change the table it would need to be It would be good to add some testcases that this code is doing what it's supposed to. — Eru·tuon 21:43, 6 January 2023 (UTC)


 * @Erutuon What this does is take characters given in the  subfield, which can be part of   or , and prevent them from having their diacritics stripped when they otherwise would be. For example, Russian . This is to allow the process of creating entry names and sort keys to be done with the NFD version of the string, which greatly simplifies the process in the majority of cases - but obviously creates the potential pitfall of stripping diacritics from characters that were previously treated as atomic, where we still want to keep them.
 * In implementing this, I intended to convert those characters into PUA characters before doing the usual substitution, so that they could be correctly converted back afterwards. However, due to this mistake, it's actually the concatenated codepoints that are being substituted in instead. In 99.9% of cases this will cause no issues, but of course there's going to be a problem if any accidental occurrences of the codepoint occur. As such, I've fixed this. Theknightwho (talk) 23:25, 6 January 2023 (UTC)
 * Okay, it looks correct now and I added a testcase to verify that it works right for Russian. I wish I knew an alternative way because the code creates an intermediate string object for every unique not-to-be-changed sequence of code points in the input string, but I can only think of more complicated ideas that I'm not sure how to implement. Lua doesn't make it easy to reduce memory allocations for strings.
 * I'd appreciate it if you'd add testcases that demonstrate that the features you're adding are working. I will add a check to Module:data consistency check that the substitutions are not going to collide with real characters or go over 0x10FFFF. At the moment we're safe because none of the  code points in the language data is greater than U+1FFFF (the difference between the maximum code point, 0x10FFFF, and the code point currently being added, 0xF0000). Any other assumptions we're making in the code should be checked there as well. — Eru·tuon 17:50, 7 January 2023 (UTC)
 * @Erutuon Great - thanks! I will do. I've also just refactored the code so as to consolidate an extra step under (what is now called) :
 * The code is now a bit easier to follow, and a tiny bit more efficient (as it should hopefully bypass any redundant normalizations).
 * It makes adding any further substitution fields straightforward. For example, the undocumented  field used by   is now fully-featured, and could (in theory) be used as a string to point at a dedicated module. Not sure if any language uses this at present, but given it was already there I felt it should be incorporated.
 * It ensures that inputs are always processed as (fixed) NFD, but output as (standard) NFC, which should prevent any future errors caused by comparing NFC and NFD strings.
 * I haven't subsumed the transliteration process under this, due to the fact that transliteration is always done via dedicated modules which are expecting (fixed) NFC inputs, but in theory it could be folded in as well. This might be an option to consider for scripts that are straightforward to transliterate, where it can be done entirely with a series of gsubs.
 * Theknightwho (talk) 18:59, 7 January 2023 (UTC)
 * Theknightwho (talk) 18:59, 7 January 2023 (UTC)

Script-specific replacements
@Erutuon @Benwing2 Just flagging that I've added a way to specify,   and   on a script-specific basis: this can be specified by using the script code as the key, and using a table/string of the usual format as the value. The use-case which led to this was Dungan, which uses from/to arrays in Cyrillic, but runs Module:zh-sortkey for Han script terms. However, it also has the added bonus of cutting down the number of pointless gsubs in languages like Serbo-Croatian. 0 pages have thrown memory errors, and I'm noticing small reductions in memory usage across a wide range of pages.

I was thinking that it would be good to retire Module:translit-redirect by bringing  in line with the other three (as they have identical syntax), but this might involve quite a bit of preparation work as it seems to be quite heavily integrated into everything. What do you think? Theknightwho (talk) 05:46, 28 January 2023 (UTC)
 * Hmm. I am not super familiar with Module:translit-redirect, which was created by User:Erutuon, but am I correct in saying its sole purpose is to handle per-script transliteration? If so then yes it makes sense to me to allow  to be directly conditionalized on script. Also when you say "it seems to be quite heavily integrated into everything" does "it" refer to Module:translit-redirect or   (and how many modules do you estimate need to be touched)? Benwing2 (talk) 05:55, 28 January 2023 (UTC)
 * @Benwing2 I meant Module:translit-redirect, and yes, that is its sole purpose (as far as I can tell). I haven't looked into it in any detail, but it does seem to be used for a large number of languages, and it's also connected into a few other modules (e.g. the languagecatboiler).
 * This also adds the possibility of ditching some of the simpler translit modules, where they're just performing a series of substitutions. Theknightwho (talk) 06:12, 28 January 2023 (UTC)
 * This all sounds to good to me. However, for this kind of thing I would recommend: (1) create a plan beforehand of how you'll go about implementing such a change, and circulate it around a bit; (2) do your edits in sandbox modules, test using sandbox modules and push the modules together to production at the end; (3) make sure you include detailed changelog messages with every change to a production module. I've noticed you have a tendency to make lots of little changes directly to the production modules, all with empty changelog messages; this not only puts a lot of load on the server but it makes it very hard to audit your code after the fact because there's no indication from the changelog messages of what was changed. See Module:it-verb, Module:pt-pronunc, Module:form of and pretty much any other module I've worked on in the last few years for examples of detailed changelog messages along with individual changes that are self-contained and pre-tested. What I'm recommending is overall good software engineering practices that IMO will help you in your professional career (assuming you do software of some sort for a living). Benwing2 (talk) 06:31, 28 January 2023 (UTC)
 * @Benwing2 Thanks! I will make sure to do that. I actually work in law - legal drafting and coding are surprisingly similar, to be honest. Theknightwho (talk) 06:35, 28 January 2023 (UTC)
 * Makes a lot of sense to me ... both are highly detail-oriented and logical, and generally intolerant to sloppy thinking. Benwing2 (talk) 06:36, 28 January 2023 (UTC)
 * As examples of complexity, I suggest the Thai script transliteration modules (particularly Module:th-translit and Module:pi-translit), the Sanskrit language modules and the Kharoshthi transliteration module Khar-translit.
 * These will show the perhaps obvious point that transliteration depends on both script and language, and that knocking up a general purpose Indic transliterator isn't as easy as one might at first think. You will also see that there can be different writing systems for the same script and language, as horrendously manifest in mainland SE Asia.  Module:translit-redirect has the advantage for me that I can edit it; you'll note that Module:pi-translit supports several languages and scripts. RichardW57m (talk) 16:32, 1 February 2023 (UTC)
 * @RichardW57m Module:translit-redirect is likely going to be retired in the medium-term, so that the correct transliteration module can be selected directly from the data module (rather than going through a middleman, which doesn't really do very much other than waste resources by existing). I've already streamlined the substitution process to work on a script-specific basis, which has halved the number of substitutions in Serbo-Croation, for example, and also means we can use a mix of substitutions and modules depending on the script (cf. Dungan). This is particularly advantageous for Pali and Sanskrit, of course, where the potential gains are much greater due to the number of scripts. On top of that, I have also introduced Module:languages/shareddata, which makes it possible to retire some of the simpler modules which exist only to synchronise substitutions between groups of languages. Thai and Lao would certainly benefit from that, and I suspect several other Indic scripts, too.
 * I also agree that many of our transliteration modules are problematic: the fact they're segregated off has allowed some of them to turn into monstrosities for no practical reason, and sometimes they don't get noticed. See, for example, the state of ky-translit at one point, which thankfully I was able to flag up quite quickly. One solution to this would be to integrate many of the simpler transliteration procedures into the substitution process, which not only reduces waste by reducing the number of modules loaded on a page, but also makes it much more likely that wasteful crap can be spotted more quickly. Obviously this wouldn't be possible for all languages, but it certainly would be for a great number.
 * In the meantime, it might be worth you requesting template editor permissions, given this change would be an issue for you. It also sounds like it would be helpful for you generally, too. Theknightwho (talk) 16:59, 1 February 2023 (UTC)
 * Correction: It is Module:translit-redirect/data that I edit, but I would need access to Module:languages/shareddata to replace its loss. Absorbing the code module into Module:scripts or Module:languages would not be a loss, beyond trashing the page cache. --RichardW57m (talk) 18:13, 1 February 2023 (UTC)
 * There may be some benefits from genericising the Thai language transliteration complex, but one needs something as horrendous as that because automatic Thai syllabification is difficult, and perfection is impossible (see ). (Programmers who move to Thailand often get sucked into the swamp of automatic transliteration.) The problem is that genericisation might increase the module count, which at present would be an instant loss, albeit probably a long term maintainability gain - except for the code hiding of object-orientation.  Tai languages called Thai dialects may need that, though Northern Thai in Lanna script looks more tractable, partly for the same reason that Lao is.  By contrast, Pali in Thai script is a well-behaved Indic language for transliteration; the only problem is that one has to detect the writing system (alphabet v. abugida), which can usually be done when it makes a difference. --RichardW57m (talk) 18:13, 1 February 2023 (UTC)
 * The avowed primary purpose of Module:translit-redirect was to save run-time memory in Lua. There seems to be a memory cost for each module used.  Now, if one invokes a language's transliteration module, and then that invokes another module to do the actual transliteration, that is 2 modules used.  I suspect the cost was greatest for pages with translation tables, where multiple languages need to be transliterated. --RichardW57m (talk) 16:57, 1 February 2023 (UTC)

Error
Whenever I use any template that relies upon this module, I get this error message, "Lua error in Module:languages at line 791: The function getByCode expects a string as its first argument, but received a table". Please fix it. Sbb1413 (he) (talk • contribs) 07:21, 27 March 2023 (UTC)

Unwanted uppercase sort key for Georgian
Georgian scripts are unicameral, i.e. they don't use the upper case (see Georgian Extended, "Unlike all other casing scripts in Unicode, there is no title casing between Mkhedruli and Mtavruli letters"), but this module returns a capitalized sort key. I.e., after I removed an explicitly set category keeping the one added by the head template, the sortkey became ᲛᲘ- instead of მი- (see Category:Georgian preverbs with sort keys in uppercase; I created a test:  gives ""). So, the category engine deals correctly with categories for Georgian, while the module does not. Please fix. Gradilion (talk) 15:17, 13 April 2023 (UTC)

Piped links fail
1) While the wikitext  works as nouns,   incorrectly produces

This used to work with the pipe. Of course, without the pipe, it works fine. On my slightly older fork of Wiktionary, it is currently considered an unsupported title. Dpleibovitz (talk) 13:49, 24 October 2023 (UTC)


 * @Dpleibovitz Thanks - it should be an unsupported title link. What date is your fork from?
 * By the way, I’m currently working on a massive revamp of link parsing, which should eliminate most of these issues: Module:User:Theknightwho/wikitext parser. It may be easier to wait until that’s ready, since this is quite a niche issue and may be difficult to bugfix with the current code. Theknightwho (talk) 14:10, 24 October 2023 (UTC)
 * @Theknightwho Complicated answer. I just recently upgraded from MediaWiki 1.31 to 1.39. Content is a mixture ranging from 2013 onwards of multiple Wikimedia wikis (sisters to you, related knowledge projects to me). Typically, Wiktionary (template/module) code causes me the most angst. Its only a couple of weeks old. But the rate of change is massive, it is brittle (understandably so) and I must update it often as bugs crop up. Much of Wikipedia code is less brittle. Wish you had versioning control, but I'm kinda adding that in my fork (as a prototype). Note that I'm manually copying content on a page by page basis (via copy & paste) as I need them for my research. It is not a traditional fork (of the entire content). When a new page needs new functionality, I must import the necessary template/module code as needed. Often these require existing template/module code to be upgraded. Argh!
 * @Theknightwho Complicated answer. I just recently upgraded from MediaWiki 1.31 to 1.39. Content is a mixture ranging from 2013 onwards of multiple Wikimedia wikis (sisters to you, related knowledge projects to me). Typically, Wiktionary (template/module) code causes me the most angst. Its only a couple of weeks old. But the rate of change is massive, it is brittle (understandably so) and I must update it often as bugs crop up. Much of Wikipedia code is less brittle. Wish you had versioning control, but I'm kinda adding that in my fork (as a prototype). Note that I'm manually copying content on a page by page basis (via copy & paste) as I need them for my research. It is not a traditional fork (of the entire content). When a new page needs new functionality, I must import the necessary template/module code as needed. Often these require existing template/module code to be upgraded. Argh!


 * Please do inform me when the revamp is done. Thanks Dpleibovitz (talk) 14:48, 24 October 2023 (UTC)
 * @Dpleibovitz Thanks for the detail - that makes sense. The aim of the wikitext parser is to get rid of many of the issues you mention: Wiktionary code is a lot more interconnected than Wikipedia's (to my understanding), and many of the problems are caused by a large number of people with varying levels of skill all trying to tackle the same recurring problems with a bunch of different, incompatible strategies. The new approach began as a heavily-modified version of mwparserfromhell, but is now pretty much all my own code: a node-tree is built using a tokeniser, with the various nodes representing things like wikilinks, html tags etc, which can be nested within each other as appropriate. These nodes have various methods, and can recursively iterate over themselves and any child nodes, so that it's trivial to (e.g.) do replacements in all visible text while preserving formatting, link targets, and so on.
 * There are various advantages to this: (a) it's extremely fast, (b) it compartmentalises everything, making upgrades/new features/bugfixing much simpler, (c) it allows the pre-existing modules to be simplified, because we can remove all the shitty workarounds they have at the moment, and (d) it makes things easier for the user, since they won't need to think about wiki formatting, html etc. most of the time. Theknightwho (talk) 15:16, 24 October 2023 (UTC)
 * There are various advantages to this: (a) it's extremely fast, (b) it compartmentalises everything, making upgrades/new features/bugfixing much simpler, (c) it allows the pre-existing modules to be simplified, because we can remove all the shitty workarounds they have at the moment, and (d) it makes things easier for the user, since they won't need to think about wiki formatting, html etc. most of the time. Theknightwho (talk) 15:16, 24 October 2023 (UTC)

2) On a related note (same set of software changes), if a piped name is not specified, it should not be placed in the generated link. In my fork of Wiktionary, this allowed the DISPLAYTITLE extension to update the displayed name. For whatever reason, many of my actual pages have a prefix of "Wikt/" which are hidden by DISPLAYTITLE. Again, this used to work but broke with the refactoring of unsupported titles. Dpleibovitz (talk) 13:56, 24 October 2023 (UTC)

Classical vs Modern Guarani
Would anyone please be kind enough to fix this? Paraguayan Guarani (gug) is a descendant of Classical Guarani (gn-cls). However, the latter doesn't appear as ancestor. It's the only thing stopping me from adding thousands of lemmas (when trying to use an 'inh' template it gives me an error). Thanks in advance! ~ 𝔪𝔢𝔪𝔬 (talk) 23:09, 10 November 2023 (UTC)