Wiktionary:Todo/Lists/technical documentation

This page contains technical documentation explaining how the Todo Lists project works.

Toolforge
The Todo Lists project runs on Toolforge. The tool name is.

If you are a maintainer of the project, you can manage the tool by logging into a Toolforge shell using  (see the quick start guide) and typing. All shell commands on this page will only work if you have "become" the tool account.

Updating from Git
The code for the Todo Lists project lives at https://gitlab.wikimedia.org/toolforge-repos/wikt-todo.

On Toolforge, there is a copy of this repo in the  directory under the   tool account's home directory.

If new commits have been made to the repo, you can update the copy of the repo on Toolforge by running:

cd src git reset --hard  # erases local changes from any mucking around or testing git pull origin

Scheduled todo list runs
Automatic generation of the SQL and custom todo lists is currently scheduled to take place every week, at a time determined by Toolforge. The XML dump todo list script is scheduled to run every day, but the script does nothing unless a new dump file is identified.

Job scheduling is defined in the  file. The format is documented here.

If you make changes to the  file, you must reload it using:

toolforge-jobs load src/jobs.yaml

Failure emails
If you get a failure email regarding a scheduled tool run, inspect the  log file in the tool's directory.

"Killed" failure
If the lists are silently failing to be generated, and upon inspecting the  log file you see the word "Killed", it's likely that the job exceeded the memory limit imposed by Kubernetes. You can verify this by looking at the memory graphs at the Grafana dashboard.

The default per-job memory limit is 512 MiB. This should be more than sufficient for most purposes. If the job is running out of memory, there might be a buggy script that is trying to include almost every page on Wiktionary in its result set. You can run a job manually with increased memory by appending, say,  (for 3 GiB) to the   command line below. This should allow the job to complete and you will hopefully be able to work out what is going wrong by looking at the resulting todo list.

The "Update now" system
SQL-based todo list pages have an "Update now" button. When clicked, the user is taken to. This page is served by the Toolforge web service documented at wikitech:Help:Toolforge/Web/Python.

The web server code is in. This is symlinked from, the mandatory location for Python web applications. Private configuration parameters are in.

If the website goes down, the web service can be started or restarted using: webservice restart Logs for the web service itself are in, while logs for the generate-lists-* script itself are in.

Running a todo list ad hoc from the server
To run a single todo list on an ad hoc basis, run the following command, where  is one of ,   or  , and   is the exact name of the desired todo list:

toolforge-jobs run mytodo --command "~/pyvenv/bin/python src/generate_lists_type.py 'Todo list name'" --image python3.11

Keep an eye on the status of the job using:

toolforge-jobs list

Once the job finishes, it will no longer be present in the list. Any error output will be saved to  in the tool account's home directory - run   to view it. Regular print statement output will be saved to. Consider deleting these output files once done to keep the tool directory clean.

Types of todo lists
Currently there are two ways to generate todo lists: SQL and XML dump. It is intended to add two further types of todo lists: HTML dump and custom.

SQL
SQL-based todo lists are generated by simply running an SQL query against Wiktionary's database (or, more precisely, the read-only database replicas available through Toolforge) and formatting the output.

SQL is the best option for any todo list that does not require analysis of page content (wikitext). Anything relating to: can be achieved using pure SQL.
 * basic page metadata as found at Special:PageInfo
 * links between pages as found at Special:WhatLinksHere
 * category structures and categorisation
 * whether or not a page uses a certain template (but not the template parameters or where on the page the template is used)
 * page histories
 * log entries as found at Special:Log
 * or any combination of these

SQL queries run very quickly if written well. Most of the SQL-based todo lists take just a few seconds to generate. However, it can be challenging to write an SQL query that is both correct and fast, especially if you do not have much experience with SQL. The MediaWiki relational database structure diagram and list of special views on Toolforge are essential references.


 * List definitions: SQL todo lists are defined in . To define a new SQL todo list, simply add a new entry to the   dictionary.
 * Testing: When developing a new query, use the MariaDB SQL console on Toolforge to test it.

XML dump
XML dump-based todo lists are generated by iterating through every line of wikitext (page source code) on every page of the latest Wiktionary XML dump and running Python code to compile the resulting todo list.

These todo lists are a good choice for detecting common misuses of templates or wikitext formatting. The XML dump contains the wikitext of the latest revision of each page, along with the page ID, namespace and title, but little else.

An abstraction layer is provided so that the Python code for each todo list does not need to parse XML. For convenience, the abstraction layer keeps track of a hierarchy of section headings and, in some cases, supplies a parsed version of the line of wikitext alongside the wikitext itself. The Python script is free to run SQL queries or API requests as required.

python3 generate_lists_xmldump.py 'Todo list name' --dry-run --file /path/to/xml-dump.bz2 or .xml
 * List definitions: XML dump todo lists are defined as Python classes in the  directory. To define a new todo list, make a copy of the   file in the relevant directory and write your code.
 * Testing: You can run an XML dump todo list on your local machine. Download the  or   XML dump and run the script locally:

Future types
The Enterprise Wikimedia HTML dumps are convenient for certain kinds of analysis where the fully rendered page is required. These dumps, which contain page wikitext and categorisation information alongside the page HTML, are a superset of the XML dump for main namespace pages. However, Enterprise Wikimedia only generates HTML dumps for select namespaces; our custom namespaces, such as Reconstruction, are not included.

Custom todo lists would involve simply running a Python script that returns a table of results. In practice, these todo lists would typically run an SQL query, then perform some kind of post-processing on the query results to generate the todo list.

Todo list output
The output of each todo list is a list of dictionaries. The SQL generator converts the SQL query result set into this format. For Python-based todo lists, the code must return a list of dictionaries, where every dictionary in the list has exactly the same keys. The column names/key names need to adhere to a special format, as explained below.

This data is converted to wikitext, using a sortable table format if more than one key is present in the dictionary besides the optional, or a bulleted list format otherwise.

Section headings
The todo list output can optionally be divided into sections (L4 headers). This is achieved by adding a special  column (key) to the output.

It is critical that the output is sorted first by this section heading key! Otherwise the section headings will be uselessly repeated in random parts of the list.

Column formatting codes
Every column name (key name) except  must contain at least one formatting code. These are written in ALL CAPS and placed after the displayed column name, set off by underscores, for example,  or.

For convenience in SQL queries, underscores will be replaced by spaces in the column name itself, so  would work too.

The formatting codes are defined in the  function in. Here is a summary:


 * PAGE: Formats the data as a regular link to a Wiktionary page, including the namespace name (if any).
 * This formatting code is intended for XML dump scripts, where the  parameter already includes the namespace name.


 * NSTITLE: Formats the data as a link to a Wiktionary page. The column should contain values of the form,   or  , made up of the namespace number, a pipe character, and the page title (either underscores or spaces are fine).
 * This formatting code is intended for SQL queries, where namespace names are not available. Generate a NSTITLE value using  or equivalent.


 * TALKLINK: Can be included after PAGE or NSTITLE to add a "talk" link. The link is only added for pages in non-talk (even-numbered) namespaces.
 * EDITLINK: Can be included after PAGE or NSTITLE to add an "edit" link.
 * HISTLINK: Can be included after PAGE or NSTITLE to add a "history" link.
 * RAW: Raw wikitext. Use this if you know the output won't contain special wikitext characters (a simple number for example), or if you need to substitute magic words, like subst:NS:..., or templates, like ..., into the output.
 * NOWIKI: Wrap in  tags.
 * CODE: Wrap in  tags.
 * CODE50: Wrap in  tags, keeping only the first 50 characters and adding ... if more characters are present.
 * CODE100: Wrap in  tags, keeping only the first 100 characters and adding ... if more characters are present.