User:AutoFormat/code

notice
This comes with several very important caveats:


 * I am a professional software engineer, this is what I do; however this code was written for my own use, and is not warranted, and does not carry any implication of merchantability or fitness for use.


 * Like everything else in the Wiktionary, this is under GFDL. GFDL is not compatible with the GPL, this document is not licensed under the GPL as software. (!)


 * At any given moment, this code may not represent what is being run; I have no intention of updating this page every time I make a change.

technical notes
I don't have my ego attached to code I write; I routinely dump code that has gotten too complex, and re-write it. On the other hand, even if something is sloppy, if it is tested and works, I leave it alone.


 * Some of the comments may be snarky.
 * The comments are often (usually) written to remind me of something, not to explicate the code.
 * Since I modify this regularly, there is code that is not reached or otherwise redundant.
 * The pre-parsing should go deeper; a fairly major restructuring would be helpful at some point soon.
 * There are a small number of known (to me ;-) bugs that I handle by monitoring the edits done, having not yet fixed them. (Like handling multi-line comments.)
 * The wikipedia.py module AF uses is heavily modified from the distro; however the interface is the same. In the presence of network problems/failures/outages AF may abort when the modified version would have recovered. The exceptions thrown are the same, but under differing conditions.
 * On Linux, the clock timing works, but will display ugly large values.
 * The code to handle headers is largely hacked to implement the "Connel" flag ....
 * The code that handles Etymology headers is based on the current WT:ELE; there is no problem changing it when we figure out how Etymology and Pronunciation are supposed to play nicely together in the general case.
 * It must have a sysop account as well, to read patrolled flags in RC; "enhanced" RC mode must be turned off.

outline

 * prescreen

Reads the XML dump, uses simple regex to find entries that may need attention, and builds a random index


 * rcpages

Generator called by the main routine. Calls prescreen, then cycles through reading Recent Changes, looking at the request category, and yielding pages found.


 * main

Reads configuration pages, builds tables to be used. Loops on rcpages generator, for each entry:


 * runs regex on the entire text
 * breaks entry into language sections, plus prolog (above first section), and iwikis
 * in each language section:
 * looks for and fixes Etymology headers
 * herds cats
 * fixes bad headers
 * fixes linking in trans tables
 * fixes top to trans-top
 * subst's (replaces) language code template
 * etc
 * then reassembles the entry, removing multiple blank lines, adding rules, and so on
 * checks the actions performed
 * if any resulting action, rewrites the page