User:Robert Ullmann/spork

program notes
spork is a program that looks for words used in the English Wikipedia, and adds citations entries to the wikt

Process:

spork maintains a persistent (on-disk) cache of words in the wikt that have English sections. It then looks at random 'pedia entries, and checks every lower case word longer than 3 letters. If the word is not in the wikt, it offers to create a Citations: namespace entry, which then shows up in Category:New words from Wikipedia. If the 'pedia has spelled the word incorrectly, a replacement can be supplied, and the program will then set up an edit on the article for approval.

The cache is updated with several steps:


 * 1) if the word is present in the disk cache, the word exists
 * 2) else spork reads some of an XML dump (if available), and then checks to see if the word has been found
 * 3) else it reads a few entries from a likely category, to see if it find the word
 * 4) else it looks for the specific entry in the wikt, checking for an English section

While this may seem convoluted, it serves a number of purposes. The XML file is not needed, and it doesn't matter greatly if it is stale, even a year old will help, but not affect the validity. So one doesn't need to worry about going to retrieve the XML frequently; getting one when setting up will be fine. It may be the daily, the WMF "pages-articles" or the WMF "pages-meta-current" dump; the "history" dump will not work.

If not found, spork will read a likely category, reading a few entries (20) and adding them to the cache; as these are English POS categories, it need not read each entry to check for the English section. This updates the cache more quickly than looking at each entry individually.

Finally, as it will resort to reading the entry, it will be completely up to date, including entries not categorized, before asking about creating a citation.