User:AutoFormat

Owner/runner: User:Robert Ullmann

Source: User:AutoFormat/code

Usage
To queue an entry for AutoFormat, add tag, exactly that way with no parameters. It can be placed anywhere reasonable. The page will appear in Category:Requests for autoformat.

The bot reads Recent changes, and checks each main namespace entry that has been patrolled. It waits 15-20 minutes before the check, sometimes longer when busy to limit traffic.

When otherwise idle, the bot picks up an entry at longish intervals from a prescreen of a recent XML dump, based on some simple tests, and then checks the current entry.

Principles
The essential idea is based on an observation: while the English Wiktionary has&mdash;as it must&mdash;a fairly rigid format, it is pointless and unproductive to go about complaining to users (both new and old) about fiddly details.

If one has to snap at some newbie for writing "Related Terms" instead of "Related terms", or for forgetting the odd horizontal rule between language sections, there is less time for more productive work, and the newbie may (will) find the requirements unfriendly. In the case of the contributor who has been around for a very long time and writes ===noun=== it is perhaps even less productive to engage in a talk page conversation.

There are also things people persist in doing even when they know better: using PAGENAME and forgetting to subst it; using because they used to it. (And then there is "Pronounciation" ...) These are better fixed than complained about.

The more controversial class of problems is when someone writes an entry using a straight line of level 3 headers, or doesn't know how to really nest the etymology sections, but the structure is well-determined. This is very common, as is (oddly enough) starting off with Etymology or Pronunciation at level 4 (which appears to work, because the WM software displays the TOC correctly in spite of that!)

One observation from the initial testing is that involved, serious users notice the changes made by AutoFormat, probably from the their watch lists, and learn from them.


 * AutoFormat is not a policy instrument

It fixes things that are errors, common mis-understandings of standard format; it is not intended to enforce policy. Anything controversial or being debated is outside of scope. Any 'bot action to implement a new policy should be done by a purposed bot running from the XML dump, since it should be fixing the entire wikt. An example would be changing "Scots Gaelic" to "Scottish Gaelic". (Although much later it might routinely canonicalize the language name.)


 * AutoFormat fixes syntax, not semantics

Blank lines, fiddly spacing in headers, spelling errors in headers, etc, etc. But not (for example) trying to convert a non-standard "Transitive verb" header to "Verb" with the definition line(s) tagged with, this would be asking for semantic errors. AutoFormat tags such cases for attention.

Bot actions
Note that when the bot is operating autonomously, that is, not on pages flagged explicitly for rfc-auto, it does not make an edit/save page just for minor spacing. If it makes any other changes, all of the minor spacing etc. is done. If the entry is new, it will apply the minor spacing and related changes, such as changing "category" to "Category".

Language sections
Sorts language sections into canonical order, prolog code at the top ( template, etc), all iwikis at end. See User:AutoFormat/Languages for the control table.


 * L2 headers with no extra spaces around language name
 * no extra stuff on line after ==
 * L1 headers treated as (converted to) L2 (may be wrong, but not evil)
 * imbalance (==lang===) fixed
 * wikilinking stripped in L2 headers in WT:TOP40
 * blank line and divider before L2 sections other than the first (technically part of previous section)
 * no blank line after prolog
 * one blank line before iwikis
 * no sort or check of iwikis (see User:Interwicket)
 * subst language code templates in L2 headers
 * flag known bad L2 sections (See also, Etymology) with

Headers
Headers other than L2 language headers. Bot action is controlled by User:AutoFormat/Headers.


 * no extra spaces, anything after header moved to next line
 * imbalance fixed
 * blank line before header
 * correct common form errors
 * correct common misspellings with fuzzy match
 * correct level errors, tag uncorrected level errors with
 * level errors in multiple etymology case
 * tag unknown/non-standard headers with

Categories
Categories are listed at the end of the language section, after a blank line.


 * category changed to Category, remove space before category name
 * move categories to correct language section
 * remove duplicates

POS sections
In sections marked as POS, whether also NS or not.


 * no blank lines until after inflection/headword line
 * one blank line before first definition
 * adds inflection line if no inflection line/headword repeater (see below)
 * one space after # at the start of definitions
 * adds if no definition line in section (before subsections, which is the only place definitions should occur)

Translations sections

 * substitute language code templates
 * format lines "*(language): " and text
 * leaves "*:" lines alone for now
 * unlink Top 40
 * link unusual languages
 * convert 'pedia links to languages to wikt links
 * convert genders in single quotes to templates, converts deprecated templates
 * convert to  when gloss is present
 * sorts sections with and re-balances columns

Multiple etymologies

 * flag any Etymology header not at level 3 with rfc
 * except: at level 4 as the first header, change to 3 (quite common)
 * number multiple etymology sections correctly
 * un-number if just one
 * move POS sections within etymology sections to level 4+

Context labels
Replace context labels in parenthesis and italics with context templates. Control table is User:AutoFormat/Contexts.


 * replace (string) at the start of a definition line with the corresponding label template
 * add the lang= parameter when not English
 * no replacement if the language code is not available
 * remove corresponding topic category if present, unless explicit category link has a sort key
 * identify modifiers for context
 * multiple labels to

Other
Small things. Note that any of these replacements could be done by bot on the entire wikt, but the point here is that AF is mostly about fixing new contributions and edits; these are things that are commonly introduced. A number of editors routinely use "wikipediapar", and "Wikipedia" is common among people from the 'pedia who expect a template name to be capitalized. (e.g. w:Template:Wiktionary), we have the redirect so they can do what they expect easily, and then we fix it. Likewise, people persist in using PAGENAME without subst'ing it.


 * subst:PAGENAME
 * replace with
 * replace with
 * replace with, similarly with Initialism and Abbreviation; these are properly lower case, but people tend to use the upper case form because headers should be upper case
 * replace with
 * replace with,  with

More details
This is more detailed description of some of the above.

Tagging
AutoFormat adds tags to entries when they require further attention.


 * for a serious problem, known header (such as "See also"), occurring at level 2 or 1; AutoFormat makes no other changes
 * for an Etymology header not at level 3, except if at L4 and the first header in the language section, corrected. (a fairly common case!)
 * for a header that cannot be corrected
 * for structure problems that are not fixed
 * for headers Transitive verb, Intransitive verb, and Reflexive verb
 * for headers of the form X phrase
 * for translations tables containing lines that cannot be parsed well enough to permit sorting

It only adds these if there is no existing cleanup tag; that is, no template starting with "rfc". In almost all cases, it will only add one, even if there are multiple problems. The rfc-header, rfc-level, rfc-trverb, and rfc-xphrase tags are quiet: they have no visible effect on the page except for the added categories.


 * is added when there appears to be no definition line in a POS section. Often this is a format problem, such as using * instead of # for a definition.

Inflection lines
When AutoFormat generates an inflection line, it writes it as  , this will usually add the correct categorization.

It also add a few more specific things:


 * if language is English, and POS is Noun, Verb, Adverb, or Adjective, adds the appropriate template-needed category
 * if the page name is in Arabic script, uses sc=Arab
 * if the page name is in Han characters and language is Korean or Vietnamese, uses sc=Hant (traditional)
 * if Han and Japanese, adds sc=Jpan
 * else if Han adds sc=Hani
 * adds as needed
 * and similar for other scripts

These occur with remarkable frequency, as the use of headword repeaters for each and every POS is not intuitive.

Sorting translations sections
Some details:


 * only sections marked with (including those converted on the same edit)
 * balancing is based on the wikitext lines
 * nested sections (using *: for subsequent lines) are, of course, kept together, and counted correctly
 * languages within them (such as Chinese languages) are not sorted
 * is read correctly
 * ill-formatted lines, including comments by themselves, prevent sorting
 * the code guesses when a long line will wrap, and should be counted as two or more; it can't do this exactly because the rendering depends on details of template expansion, browser, platform, user fonts, and user window size ...
 * the vertical spacing of the lines varies a bit with the sub-sections (*: lines), some fonts, and uses
 * a comment immediately following will remain there
 * comments anywhere else on lines by themselves will prevent sorting (there is no way to know which language line they belong to)