User talk:Jberkel/lists/wanted

Catalan
Can you include Catalan next month? Ultimateria (talk) 16:29, 18 June 2019 (UTC)
 * Sure, will do. Which reminds me, I wanted to ask you about Catalan resources, can you recommend anything? I'm looking for interesting reading material (magazine format, science, arts etc.) and podcasts. – Jberkel 16:49, 18 June 2019 (UTC)
 * I don't know about podcasts (my comprehension of spoken Catalan is quite low), but I can wholeheartedly recommend the magazine Time Out Barcelona. It's a great way to pick up informal and creative Catalan words. (I used to read these every week when I lived in Barcelona, highlighting unknown words. I only realized today that you can download them!) This website has tons of open-access academic articles, but if you want something more casual for science, check the science section at ara.cat. Some interesting headlines there. I used to read the culture blog Atzucac, but they haven't published anything since 2017. It's worth looking through the archive though. Ultimateria (talk) 17:14, 18 June 2019 (UTC)
 * Can Slovene be added as well? Also, perhaps an improvement but it would need more work: to group the links by where they appear. Groupings could be translations, the etymology of the language in question, derived/related terms of the language in question, and others. —Rua (mew) 13:25, 29 June 2019 (UTC)
 * Sure, I'll add Slovene to the list. Would grouping really be that helpful? Many entries/languages have only around 10 incoming links max (with some outliers, like Russian). – Jberkel 16:24, 29 June 2019 (UTC)
 * I was hoping to create entries for all Slovene terms appearing in Slovene etymologies first, so that etymologies only have blue links. For that, it would be helpful if there was a separate listing of all red/orange links to Slovene terms that appear in Slovene etymologies. —Rua (mew) 10:06, 30 June 2019 (UTC)
 * Slovene has now been added. There are a couple of bluelinks in there, probably an issue with diacritic filtering. Regarding the grouping, I misunderstood, you want a separate table per link source? It would be nice to have everything in one table, but I don't know if table filtering can be done easily. – Jberkel 00:34, 5 July 2019 (UTC)

Latest lists
I'm planning to link the latest Ancient Greek list from About Ancient Greek, but the problem is that the page name changes as each dump comes out. Could you have your script update "latest" pages, like User:Jberkel/lists/wanted/latest/grc, so that they redirect to the most recent dated page (User:Jberkel/lists/wanted/20190801/grc currently)? — Eru·tuon 18:06, 17 August 2019 (UTC)
 * Ok, done. the redirects need to be created for each language, right? there's no way to just redirect YYYYMMDD -> latest? – Jberkel 20:38, 17 August 2019 (UTC)
 * Do you mean the reverse, to use template syntax or something to automatically redirect the "latest" pages to the correct YYYYMMDD page by comparing the current date to the days of the month when the pages-articles.xml is finished generating? I and the redirect syntax apparently requires a literal link. — Eru·tuon 22:33, 17 August 2019 (UTC)
 * Interesting idea, but no, I meant a single redirect from the timestamp to "latest" (like a link on a filesystem), but I realized that's not how redirects work on MediaWiki. – Jberkel 05:55, 18 August 2019 (UTC)

List of languages that get their own lists
To allow people to request lists for a particular language, perhaps there should be a page that specifies which languages get their own lists. Then people can add the language code or language name for languages that they are interested in, and the script can read it and generate pages for those languages when the lists are updated. (For instance, Arabic should probably be included, and it's got more links than Italian.) I was adding a link to these lists in WT:NEWS and it occurred to me that people would probably want to be able to request lists for more languages, since downloading the master list and filtering it requires technical knowledge. Perhaps it could be located at User:Jberkel/lists/wanted/languages, unless there are other settings that people should be able to change. — Eru·tuon 21:26, 10 October 2019 (UTC)
 * A page with a list of languages is good, the script could just fetch it before the run and configure itself. I'm not sure what is up with Arabic, it should really be in there. I'll have a look. – Jberkel
 * I went ahead and created User:Jberkel/lists/wanted/languages, it will be used on the next run. – Jberkel 10:19, 21 October 2019 (UTC)

Spurious wanted items
There are a large number of English items that appear as affixes that are actually just words used as bases of morphologically derived terms, eg, -Raphael, rather than Raphael. The problem seems to be the reading of Raphael as used in. For some reason a spurious "-" is added. DCDuring (talk) 21:11, 18 October 2019 (UTC)
 * For Jberkel (though this might be obvious), the culprit is . It seems to treat the first parameter as the prefix and the second parameter as the suffix. But it's the third parameter that should be the suffix. — Eru·tuon 21:34, 18 October 2019 (UTC)
 * Thanks, I'll fix it. So it's always the last parameter in the list which is to be treated as suffix. – Jberkel 22:07, 18 October 2019 (UTC)
 * Usually it's the last numbered parameter because there are only three or four numbered parameters, but there are four cases of with more than four numbered parameters in the last dump, all of them with an empty fifth parameter. I should have said, the prefix is the second parameter and the suffix is the fourth parameter if present, otherwise the third parameter. If there are more numbered parameters, they are simply ignored, even when they're not empty: . — Eru·tuon 22:27, 18 October 2019 (UTC)
 * Ok. I often checked the documentation pages for guidance, and the example given on just uses two parameters, probably an atypical usage example. – Jberkel 05:08, 19 October 2019 (UTC)
 * Ahh, I see. I've added an example with three parts for clarity. Actually, two parts is more common: 11,925 cases versus 3,695 with three. — Eru·tuon 05:23, 19 October 2019 (UTC)
 * This is fixed in the code, the next version of the lists shouldn't include these items. – Jberkel 09:04, 19 October 2019 (UTC)

Template redirects
It looks like the list of templates is missing some redirects (Template:suf, Template:pre, Template:con). Ideally all the names of redirects would be retrieved at the time when the dump comes out. (Or maybe they could be gleaned from the dump.) My Rust program has a subcommand to do that, but I imagine it would be hard to integrate its output into the Java source code. The output is also in a specific format used by the template-dumping subcommand.

(Also, crosslink to BP discussion involving template names.) — Eru·tuon 22:50, 10 November 2019 (UTC)
 * Ah, good catch. The index actually has redirect information (which is needed for checking the existence of entries), it's just not used for templates. – Jberkel 17:14, 12 November 2019 (UTC)

Normal operation resumed
I finally migrated the project to Spark 3 and now it runs again. I'll have a look at fixing some of the open issues in the next weeks. – Jberkel 22:20, 15 November 2020 (UTC)

New HTML-based run (20220301)
The lists are now based on a HTML dump and are therefore more complete than previous versions. However, at the moment only mainspace is taking into account, and it looks like there are some inconsistencies regarding redirects. You'll also notice that some lists are now noisier (English for example now includes many place names, which were ignored in previous runs). The raw data is also missing, I'll include it in the next run. – Jberkel 09:47, 6 March 2022 (UTC)

HTML dump bugs
Unfortunately the HTML "enterprise" dumps seem a bit flaky and incomplete, some pages and categories are missing or not updated, and entire wikispecific namespaces such as  are not included. – Jberkel 22:19, 5 April 2022 (UTC)
 * T303652 (Include more namespaces in Wiktionary HTML dumps)
 * T305407 (Stale data / missing pages in HTML ("enterprise") dumps)
 * T300124 (In Wikimedia Enterprise HTML Dumps, categories and templates are not always extracted)

Grid Engine migration
Grid engine is now deprecated; I need to migrate the list generation to kubernetes. In the meantime, no updates. Also, the last two enterprisey dumps have not been generated, see T311441. —Jberkel 17:00, 11 July 2022 (UTC)
 * Update: now runs on the new infrastructure, and the enterprisey dumps are back (but still borked). – Jberkel 07:32, 12 September 2022 (UTC)

Reducing the utility of incoming links sort order
There are two kinds of phenomena that reduce the value of the incoming links search order in Cirrus search. One is the existence, for entries containing few characters, of numerous inbound links from anagrams and from. Another is the numerous inbound links from user pages. For this last I am a black kettle and need to, at least, shut down links from my old "wanted taxa" pages.

Also, anything you could do to reduce the misprioritizing of terms on your "wanted" lists (due to the anagram and links) would be useful. DCDuring (talk) 21:59, 4 September 2022 (UTC)


 * Ok, it's easy to ignore incoming links from certain sections (anagrams, see also, user namespace). Will try to include it in the next run. Jberkel 15:40, 6 September 2022 (UTC)
 * Looks like anagram links should already be excluded, maybe it's not working properly. Can you point me to some entries where this is not the case? – Jberkel 07:24, 12 September 2022 (UTC)
 * Thanks. I'm glad that the changes are easy. I'd hope they would be.
 * I will let you know about problematic links from anagrams when I find them. I may have erred in my complaint about anagrams, simply assuming they were a problem for your lists because they remain a problem for general search.
 * Do you know whether there is a way for me to have the WM search engine(s) ignore incoming links from such locales when I do a search from the search box?
 * I've disabled links from some of my user pages. I may start removing blue-linked items from the still-active user pages as well to reduce the self-inflicted part of my search problem. DCDuring (talk) 14:37, 12 September 2022 (UTC)

Update
@Jberkel Could you update the lists from the latest dump (20/10/2023)? Thanks! — Fenakhay ( حيطي · مساهماتي ) 16:12, 20 October 2023 (UTC)


 * Sadly, the HTML dumps have been broken/incomplete for a while now. I've tried to poke the WMF folks but no reaction so far, except vague promises of "Investigation". See T345176. Jberkel 17:02, 20 October 2023 (UTC)
 * Dumps are usable again (only took a year, WMF record), lists have been regenerated. Jberkel 09:27, 6 June 2024 (UTC)