User:Interwicket/code/mbwa


 * 1) !/usr/bin/python
 * 2) -*- coding: utf-8  -*-
 * 3) wikipath en wiktionary User:Interwicket/code/mbwa

""" This bot updates iwiki links between wiktionaries

It runs as a set of threads, each doing part of the process. The first group of threads generates tasks (instances of class Task), from the wiki indexes, deletions from same, recent changes, and null tasks. These are then picked up by the main thread, and distributed to 7 (identical) "hunter" threads, which read the entries for the title. The pages are then passed to a group of 4 (again identical) "replink" threads which write the changes (in module reciprocal.py)

Mbwa maintains a local database of all of the titles in all of the 170 wikts, so it knows which entries will need links. The database is built and updated automatically, it has no trouble with starting from "scratch". It is divided into 26 files, a to z on titles, titles outside a-z are reduced modulo 26 on the first letter to a-z. This is done to prevent having to rebuild the entire index when bsddb (eventually) corrupts one file; that file can be deleted and the process restarted.

Add tasks: this thread reads the title index and language links from each with the MW API. It queues a task for each apparent discrepency between an entry and the index. In the case where the title is "new", it adds the links to the index, and does not queue a task, this is essential to startup (or re-building one index file), if there are links needed they will be found while reading the index for another wikt (or the same one on the next pass). This thread also generates sets of titles queued to the next thread (delete). The indexes are read with adjusted delays in between until the interval is once per week per wikt.

Deletes: this compares each set of titles from add tasks to the indexes in the "inverse" direction, looking for titles and links in the index not found in the live wikt. In each case it generates a delete task, added to the main queue. This, in combination with the tasks queued to add links, ensures that the links will be brought up to date, regardless of RC missed, page moves, reversions, and any other odd events.

Recent changes: reads RC (from the MW API) from each wikt for new entries more than one hour old and less than two days. It treats bot entries (including those not flagged as "bot", but where the username ends with "...bot") as a lower priority when queuing them. This helps give RC entries created by humans priority. The changes are read at adapted intervals: wikts that show changes are read increasingly frequently, those that are not are read more seldomly, to a maximum of about 1 day. The overall rate of API requests is thus kept at a minimum, while still being responsive in finding changes.

Null: this thread generates a null task at intervals, to keep the main thread spinning.

Main thread: initialization, start other threads; then for each task: read the "primary" entry referred to by the task object (in the case of a delete, see if another can be found), check the links, if not complete, set up the task to be passed to a hunter thread.

Hunter: given the set of links from the page and the index, and others found along the way, read each page to be (potentially) updated for this title. Then queue all the pages with complete sets of links to the replink threads. All threads use a shared tick-tock timer to limit the page read rate.

Replink: (in reciprocal.py) replace the links in the page and write if needed, this uses another tick-tock to limit the page update rate.

Name:

"MBWA" can be given two meanings; the simplest is the acronym for Management By Wandering Around, that is, the mbwa program just looks for things worth doing, not trying to do one particular list in a particular order. However, it will -- eventually -- get to everything.

The other explanation is more complicated: in the 1970s I worked on a VLSI design system, that produced NMC tapes for a photoplotter, that in turn produced the chrome and glass reticles used to expose the pattern on silicon wafers.

The pattern tapes had the flashes (each a precisely positioned rectangle) sorted into the optimum order for the photoplotter, to minimize the amount of motion between flashes. (As optimized runs would still take 36 or more hours, this was very important!) The optimum order for typical arrays such as memory cells was boustrophedonic -- up one row and down the next -- as the ox ploughs.

In jest, we referred to an unsorted order as "urocanic" -- as the dog pisses.

Since this program appears random in behaviour, and to a great extent is, that would seem to apply, although it doesn't reduce the performance.

So mbwa: Kiswahili for "dog".

RLU 10.2.9 """

import wikipedia import xmlreader import sys import socket import re from time import * from random import random, expovariate, choice from mwapi import getwikitext, getedit, readapi, getticktock from reciprocal import addrci, flws, getflstatus, replink, plock, updstatus, toreplink, setreptick from iwlinks import getiwlinks

from config import usernames
 * 1) borrow global:

def srep(s): return repr(u''+s)[2:-1]

def reval(s): # tricky as repr uses either ' or ", but uses ' if both are present, escaping it   if "'" in s and '"' not in s: return eval('u"' + s + '"') else: return eval("u'" + s + "'")

def safe(s): return srep(s)

reblank = re.compile(r'\[\A-Za-z-]+\s*:[^\*\]\]') def isblank(t, p): # like isEmpty in wikipedia, but much better and faster, reduces to (almost) identical if len(t) > 20 and '[[' not in t[:20]: return False   # which is the 99% case, except for images atop, (and pl.wikt ;-)    if len(reblank.sub('',t).strip('\n ')) > 4: return False    return p.isEmpty # resort to exact framework test, so as not to war with others

respace = re.compile(u'[ _\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029' +                    u'\u202F\u205F\u3000]+')

def fixtitle(t): # fix a page title the way the server does to make DB keys, as pybot framework is out of date # code from server:

# $dbkey = preg_replace( '/\xE2\x80[\x8E\x8F\xAA-\xAE]/S', '', $dbkey ); #  (is BIDI, not done yet) # $dbkey = preg_replace( '/[ _\xA0\x{1680}\x{180E}\x{2000}-\x{200A}\x{2028}\x{2029}   #    \x{202F}\x{205F}\x{3000}]+/u', '_', $dbkey );

t = respace.sub(' ', t).strip(' ') return t

import shelve

Lcode = { } Exists = set site = { } naps = { }

def now: return int(clock)

Quit = False

import threading, Queue tasq = Queue.PriorityQueue
 * 1) 35K max to keep process memory < ~100MB
 * 2) 70K max to handle state of affairs (17.3.9), limit is soft in code below

huntq = Queue.Queue
 * 1) hunter queue, just big enough to distribute tasks to 7 or so hunters, soft limit

delq = Queue.Queue
 * 1) delete task queue, project sets for deltasks to check


 * 1) a single index file is 650MB (20.2.9) and if it gets corrupted, we start over
 * 2) so we use 26 of them, -a through -z, each title being in the file that is = mod 26 to the first char
 * 3) (note this will be different for some chars in a wide build, we just use the first "UTF-16" word)
 * 4) a corrupted file can just be deleted and will be re-built

mifs = range(0, 26) # list, to be shelves

def mif(s): return mifs[ord(s[0])%26] def lkey(s): return 'hijklmnopqrstuvwxyzabcdefg'[ord(s[0])%26]
 * 1) modulus ops on titles; note on a "narrow" build, this only uses the top half of surrogate pairs

milock = threading.RLock
 * 1) Mi = shelve.open("mbwa-index")
 * 1) (could use 26 locks, but this isn't going to matter that much?)


 * 1) mwba index is keyed with srep(title), value is tuple of time, links, redirs

def menc(ul, ur): return "%s,%s" % (" ".join(ul), " ".join(ur)) def mdec(s): sp = s.split(',') return sp[0].split, sp[1].split


 * 1) rather boring sequence of set routines:

def miget(title): # get the links and redirs list as far as we know with milock:

Mi = mif(title)

if srep(title) in Mi: ul, ur = mdec(Mi[srep(title)]) return ul, ur

return [], []

def miadd(code, title, others = []): # add a link from the wikt indexes, as well as others when first found with milock:

Mi = mif(title)

if srep(title) in Mi: ul, ur = mdec(Mi[srep(title)]) else: ul = [] ur = []

if code in ul: return if code in ur: ur.remove(code)

if not ul: us = set(others) us.add(code) us -= set(ur) ul = sorted(us) else: ul.append(code)

Mi[srep(title)] = menc(ul, ur)

def midel(code, title): # delete a link when we find no page with milock:

Mi = mif(title)

if srep(title) in Mi: ul, ur = mdec(Mi[srep(title)]) else: return

if code in ul: ul.remove(code) if code in ur: ur.remove(code)

if not ul and not ur: del Mi[srep(title)] else: Mi[srep(title)] = menc(ul, ur) def mired(code, title): # found a redirect, move to redirs list with milock:

Mi = mif(title)

if srep(title) in Mi: ul, ur = mdec(Mi[srep(title)]) else: ul = [] ur = []

if code in ur and code not in ul: return

if code in ul: ul.remove(code)

if code not in ur: ur.append(code)

Mi[srep(title)] = menc(ul, ur)

def miset(title, ul, ur): # at completing an entry (title), record links and redirs with milock:

Mi = mif(title)

Mi[srep(title)] = menc(ul, ur) Mi.sync

def miall(tix, nap = 1.0): # get the index keys for one of the files (title is a-z) # takes a non-trivial amount of time, but less than a minute # then return title, links, redirs for each # caller must take care not to block as we have lock held # 200 on each lock, as locking on each entry is ridiculous amounts of CPU!

Mi = mif(tix) with milock: mk = Mi.keys

klim = len(mk) k = 0 while k < klim: i = 0 with milock: while k+i < klim and i < 200: if mk[k+i] in Mi: # (may have gone away ;-) ul, ur = mdec(Mi[mk[k+i]]) yield reval(mk[k+i]), ul, ur               i += 1 k += 200 sleep(nap) # while not holding lock

# done


 * 1) read page titles and links from wikts, return apparent mismatches

from getlinks import getlinks

def livelinks(home):

redirs = flws[home].redirs

# sets of titles present, for delete scan pset = { } for k in 'abcdefghijklmnopqrstuvwxyz': pset[k] = set

# read page title and links from the wikt, compare to our index

for title, links, redflag, bad in getlinks(flws[home].site, plock = plock):

if Quit: break

pset[lkey(title)].add(title)

# if a redirect, add as such, continue if redflag: mired(home, title) continue

ul, ur = miget(title)

# ll is set, validate ... ll = set for lc in links: lc = str(lc) # not unicode (at Py2.6) if lc not in flws: # odd case, WMF server thinks it is a language (user copied links from 'pedias?) # remove these: (will set lockedwikt in flw.__init__) flws[lc].deletecode = True if flws[lc].lockedwikt and not flws[lc].deletecode: continue # we want to edit to delete in second case ll.add(lc)

# if there are no links at all (not even home!) then this is a new title (to us) # (there may have been redirects) don't do it on this pass (if it is itself okay, not bad) if not ul and not bad: miadd(home, title, sorted(ll)) continue

# make sure this wikt is present for title (it is not a redirect) if home not in ul: miadd(home, title) ul, ur = miget(title)

if redirs: ul += ur       # compare links to ul, should match # first add home to ll, then it should be identical ll.add(home)

# if not redirs, but some present, is okay (at this point): if not redirs and ur: for lc in ur: ll.discard(lc) # (also no point in trying to read them in hunt ;-)

# similar but different case for nolink, e.g. pl->ru for lc in flws[home].nolink: if lc in ul: ll.add(lc) # pretend present for comparison

# if apparent mismatch, or bad link(s) in the entry if sorted(ll) != sorted(ul) or bad:

lcs = set(ul) lcs.discard(home)

lnotu = [x for x in ll if x not in ul] unotl = [x for x in ul if x not in ll]

# with plock: print "   in LL, not in UNION:", lnotu # with plock: print "   in UNION, not in LL:", unotl

# some difference, so nk always > 1 yield title, lcs, ur, len(lnotu) + len(unotl) + 1, bad

# else: with plock: print "(%s matches)" % repr(title)

for k in 'abcdefghijklmnopqrstuvwxyz': delq.put( (home, k, pset[k]) ) # start scans for deletions in this wikt

def sign(x): if x < 0: return -1 if x > 0: return 1 return 0
 * 1) extraordinarily silly omission from Python

kloset = clock def klo: return (clock - kloset) / 1000.0


 * 1) Task class, with comparison key r controlled for time
 * 2) also slots optimization, (yes, tacky, but we gen up a lot of these)

class Task: __slots__ = [ 'home', 'title', 'r', 'nk', 'src', 'force', 'page', 'onq' ] def __init__(self, home=, title=, r=0, nk=0, src='', force=False, onq = 0): self.home = home self.title = title self.r = r or expovariate(.01) self.nk = nk       self.src = src self.force = force self.onq = onq or time self.page = None # added in main thread

# now set r to move forward, run queue, not bury things forever while self.r > 4200.0: self.r -= 700.0 self.r += klo

def __cmp__(self, other): return sign(self.r - other.r)


 * 1) return tasks in random but prioritized order

qpw = { } # queue per wikt, to implement max and to-be-done qpwlock = threading.Lock

lru = shelve.open("mbwa-lru-list")
 * 1) keep track of least recently used (done) wikts, persists on disk


 * 1) last seen, a "timeout set", a set that elements magically disappear from after a time

from weakref import WeakValueDictionary from heapq import heappush, heappop

class tmo(float): pass

class lastseen:

def __init__(s, timeout): s.timeout = timeout s.wdict = WeakValueDictionary s.theap = []

def add(s, key): t = tmo(clock) s.wdict[key] = t       heappush(s.theap, t)

def __contains__(s, key): while s.theap and s.theap[0] < clock - s.timeout: heappop(s.theap) return key in s.wdict

def addtasks:

np = 0.0

# entries seen already, use Weak-Val-Dict directly, on titles->tasks # title will be "in" seen if still on queue or hunt seen = WeakValueDictionary

# init lru: for lc in Exists: lc = str(lc) # not unicode if flws[lc].lockedwikt: continue if lc not in lru: lru[lc] = 0.0 # cleanup: for lc in lru.keys: if flws[lc].lockedwikt: del lru[lc] lrulen = len(lru)

# one off 22.5.10 # lru['ml'] = 0

# for each, read everything, set priorities

mint = 20

while not Quit:

np += 1.0/lrulen   # "pass" number

# find least recently done wikt:

home = sorted(lru.keys)[0] # or any other given code, not in unicode, but lru key old = lru[home] for lc in lru: if lru[lc] < old: old = lru[lc] home = lc

# if less than a week has passed, sleep for a while, and then go ahead # 70 min * 150 wikts is about a week (168 hours), is about the target if lru[home] + (168 * 3600) > time: mint = min(mint+2, 90) else: mint = max(mint-5, 20)

for i in range(mint, 0, -1): with plock: print "(add tasks: sleeping, %s next in %d)" % (home, i)           sleep(60) if Quit: break if Quit: break

# record it now, so if we fail (or are aborted) we go on to the next one lt = strftime("%d %B %Y %H:%M:%S", gmtime(lru[home])).lstrip('0') if not lru[home]: lt = '[never]' lru[home] = time lru.sync

# queue all within process memory reason qmax = (70000 - tasq.qsize) / 2 with qpwlock: if home not in qpw: qpw[home] = 0 qmax += qpw[home]

with plock: print "(reading links from %s.wikt, last done %s, qmax %d)" % (home, lt, qmax) # skip codes we don't want # we only want "home" wikts with bot or noflag status

flw = flws[home] ponly = False status = getflstatus(flw, nowrite = True) if status not in ["bot", "noflag", "globalbot", "test", "blocked"]: # we don't want this one at all with plock: print "(%s status is %s, not reading at all)" % (home, flw.status) continue if status not in ["bot", "noflag", "globalbot"]: with plock: print "(%s status is %s, not queueing from links)" % (home, flw.status) ponly = True tf = 0 qt = 0

for title, lcs, urs, nk, bad in livelinks(home):

if Quit: break

if ponly: # not bot or noflag, we are just counting them tf += 1 continue

# if it is known as a redirect, skip it (odd case) if home in urs: continue

# clip main page here if we can if title.lower == 'main page': continue if title == flw.mainpage: continue

if title in seen: continue # queued already on this run, counted in qpw

tf += 1 # found a (new) task

if qpw[home] > qmax: continue # doesn't make the cut

with qpwlock: qpw[home] += 1

r = expovariate(.001) t = Task(home=home, title=title, r=r, nk=nk, src='idx', force=bad) seen[title] = t

# if lots on queue, no hurry ... if tasq.qsize > 500: sleep(1)

tasq.put(t) qt += 1

if Quit: break with plock: print "(found %d tasks for %s, queued %d, queue size %d)" % (tf, home, qt, tasq.qsize) if flw.status in ["bot", "noflag", "globalbot", "test"]: flw.tbd = tf - qt + qpw[home] # total less queued on this pass plus on-queue (!) updstatus(flw)

sleep(70) # rest between indexes. Is a huge effort, reading them ... (:-)       # (real reason is to keep process from spinning when little to do, main timer above)

# end of trueloop with plock: print "(add tasks thread ending)"


 * 1) now recent changes set

def recent(home = 'en'):

# entries seen already, timeout set, keep for > 3 days (use 4) seen = lastseen(4 * 86400)

# set up list of wikt codes to look at

qtime = { } rcstart = { } for lc in Exists: if flws[lc].lockedwikt: continue site[lc] = flws[lc].site naps[lc] = 60 * choice(range(3, 91)) # scatter 3 to 90 minutes if lc == home: naps[lc] = 0 # no wait for home wikt the first time qtime[lc] = now + naps[lc] # initial time rcstart[lc] = ''

ny = 0

rcex = re.compile(r']*title="(.+?)"[^>]*>') rccont = re.compile(r'rcstart="(.+?)"') rcisbot = re.compile(r']*user="[^>]*bot"[^>]*>', re.I)

while not Quit:

# sleep until next one nextq = now + 1000000 nextlc = '' for lc in qtime: if qtime[lc] < nextq: nextq = qtime[lc] nextlc = lc       st = nextq - now # if st > 90: #   with plock: print "(%d, sleeping %d minutes, %s next)" % (now, (st+29)/60, nextlc) if st > 0: sleep(st) if st < -120: with plock: print "(rc %d minutes behind)" % (-(st-29)/60) lc = nextlc

if Quit: break

# for mbwa, only read rc from non-troublesome wikts # this saves the bother of looking at closed wikts too flw = flws[lc] if getflstatus(flw, nowrite = True) not in ["bot", "noflag", "globalbot", "test"]: with plock: print "(%s status is %s, not reading rc)" % (lc, flw.status) qtime[lc] = now + 86400 # look again tomorrow continue

# read recentchanges, new entries, namespace 0, from site:

if True: # [indent]

# with plock: print "(%d, reading from %s.wikt)" % (now, lc) nf = 0

# set parameters

# from a little while ago (8 hours) if not rcstart[lc]: rcstart[lc] = '&rcstart=' + strftime('%Y-%m-%dT%H:%M:%SZ', gmtime(time - 8*3600))

# up to one hour ago rcend = '&rcend=' + strftime('%Y-%m-%dT%H:%M:%SZ', gmtime(time - 3600))

# slow start, don't need to pick up too quickly rclimit = "&rclimit=%d" % min(10 + ny/20, 200)

# with plock: print "(options " + rcend + rcshow + rclimit + ")"

try: rct = readapi(flw.site,                    "action=query&list=recentchanges&format=xml&rcprop=title|user|flags&rcdir=newer" +                     "&rctype=new&rcnamespace=0"+rcend+rcstart[lc]+rclimit,                     plock = plock) except wikipedia.NoPage: with plock: print "can't get recentchanges from %s.wikt" % lc               # rct = '' # sleep(30) qtime[lc] = now + 700 # do other things for a bit continue except KeyError: # local bogosity with plock: print "keyerror" sleep(20) continue

# (They've borked the API by making gratuitous changes, we can't check           #  to see if we have an empty "recentchanges" element, because it isn't            #  always present now! Look for an attribute too. Sigh.) if " " in rct or 'recentchanges=""' in rct: pass elif ' ' not in rct: with plock: print "some bad return from recentchanges, end tag not found" with plock: print safe(rct) # rct = '' sleep(30) qtime[lc] = now + 300 # do other things for a bit continue

# continue parameter:

mo = rccont.search(rct) if mo: rcstart[lc] = "&rcstart=" + mo.group(1) else: # we are up to date, set to one hour + 100 sec ago rcstart[lc] = '&rcstart=' + strftime('%Y-%m-%dT%H:%M:%SZ', gmtime(time - 3700))

found = False for mo in rcex.finditer(rct):

if Quit: break

title = mo.group(1)

# unescape, API uses (e.g.) #039 for single ' title = wikipedia.html2unicode(title) title = title.replace('_', ' ') if ':' in title: continue if not title: continue

isbot = False if 'bot=""' in mo.group(0): isbot = True if rcisbot.match(mo.group(0)): isbot = True

if lc + ':' + title not in seen:

lcs, urs = miget(title) # new to us or not? if lc in lcs: lcs.remove(lc) # can happen on restarts (or entry re-created) isnew = False else: isnew = True # new entry created with iwikis, do it anyway nk = len(lcs) + len(urs) + 1

if nk == 1: continue # unique title

seen.add(lc + ':' + title)

if not isbot: t = Task(home=lc, title=title, nk=nk, src='rc', force=isnew) else: t = Task(home=lc, title=title, nk=nk, src='bot', r = expovariate(.0014))

tasq.put(t) ny += 1 nf += 1 found = True

if found: naps[lc] /= 2 naps[lc] = max(naps[lc], 300) # five minutes else: mn = naps[lc]/300 # one-fifth, in minutes naps[lc] += 60 * choice(range(5, 10 + mn)) # add 5-10 minutes or longer if we don't find anything maxnap = 60 * choice(range(1400, 1500)) # around 24 hours naps[lc] = min(naps[lc], maxnap)

qtime[lc] = now + naps[lc] with plock: print "(rc found %d in %s, next in %d minutes)" % (nf, lc, (naps[lc]+29)/60)

"""           if naps[lc] > 90:                with plock:             elif naps[lc] > 30:                with plock: print "(rc found %d in %s, next in %d seconds)" % (nf, lc, naps[lc])            else:                with plock: print "(rc found %d in %s, next immediately)" % (nf, lc)            """

with plock: print "(recent changes thread ending)"

def deltasks:

# incoming queue is sets of titles, already sorted by modulus key # this is a bit more complicated than just a set for the whole wikt, but allows # us to release memory for each set as we go

psets = { }   # dict by key letter and then lc of sets queued to us  found = { }    # dict by lc of number found tbc = 0       # total titles to be checked

while True:

# scan our whole local db, looking for entries that list a title, when the title is not in wikt

for tix in 'abcdefghijklmnopqrstuvwxyz':

# look for new task sets, block as we must have one, or not enough to be worth it yet # pick up whatever is available now, add to our little structure

while delq.qsize > 0 or not psets or tbc < 100000: lc, k, pset = delq.get if k not in psets: psets[k] = { } if lc in psets[k]: tbc -= len(psets[k][lc]) psets[k][lc] = pset # if we have somehow wrapped all the way 'round use new one! found[lc] = 0      # make sure it exists tbc += len(pset)

# any for this letter? if tix not in psets: continue ptix = psets[tix]

with plock: print "(starting delete scan for %s/%s, tbc %d)" % (','.join(ptix.keys), tix, tbc)

for lc in found: found[lc] = 0

# read index file for t, ul, ur in miall(tix, nap = 1.0): # pole, pole, hakuna matata!

# NOTE milock held here by miall, don't take other locks or block #     we do take tasq sync lock implicitly, and release it            if Quit: return # should unwrap everything?

for lc in ptix: if t not in ptix[lc] and (lc in ul or lc in ur): if t != fixtitle(t): # got a bad one somewhere, delete from db now midel(lc, t)                       continue  # no need to check the fixed title?

""" better handled by 'exists' for 'del' below? more general case?                   if lc == 'ml' and u'\u0d4d\u200d' in t:                        # "bad" titles from before forced 5.1 "normalization", don't do delete                        # op, it will add bad iwikis                        # this prevents thrashing, but doesn't solve problem                        continue                    """

# queue up a delete task task = Task(home=lc, title=t, src='del') tasq.put(task) found[lc] += 1 break # don't look at other wikts, one task is enough

for lc in ptix: if found[lc]: with plock: print "(delete scan for %s/%s, %d found)" % (lc, tix, found[lc]) tbc -= len(ptix[lc]) ptix = None del psets[tix] # done with all for letter key, discard

# end of True-loop with plock: print "(delete thread ending)"

def nulltask:

wasrt = 10.0

# keep main task queue and thread slithy et lubriceaux

while not Quit: sleep( min(tasq.qsize + 70, 350) ) tasq.put( Task(src='null') )

# adjust rate (mostly this is for fun, though it is useful in spreading load ;-) # below 200, is 7 sec, above 2700 is 2 seconds, at 7000 1 second, at 10K no reptick

rt = min(max((3700.0-tasq.qsize)/500.0, 2.0), 7.0)

# and corrections outside range: (cover the range, so no restarts, this is the serious advant) if tasq.qsize > 5000: rt = 1.5 if tasq.qsize > 7000: rt = 1.0 if tasq.qsize > 10000: rt = 0.0 if tasq.qsize < 10: rt = 10.0

if int(wasrt*10) != int(rt*10): with plock: print "(replink ticktock was %.3f, now %.3f)" % (wasrt, rt) setreptick(rt) wasrt = rt

with plock: print "(null thread exiting)"

def main:

socket.setdefaulttimeout(70)

with plock: flws['en'].site.forceLogin

# setup basics

for c in 'hijklmnopqrstuvwxyzabcdefg': mifs[ord(c)%26] = shelve.open('mbwa/mbwa-index-' + c, protocol = 2)

enw = wikipedia.getSite(code = "en", fam = "wiktionary")

# make sure we have an flw for everything claimed to be in family (including stops) for code in flws['en'].site.family.langs: foo = flws[code]

# get active wikt list # minus crap. Tokipona? what are they thinking? Klingon? ;-) deleted ISO code   # se has no wiktionary (not even closed), as is locked (but not shown locked in table?)    Lstops = ['tokipona', 'tlh', 'sh', 'se', 'as']

sitematrix = readapi(enw, "action=sitematrix&format=xml")

rematrix = re.compile(r'//([a-z-]+)\.wiktionary')

sms = set for code in rematrix.findall(sitematrix): sms.add(code) # print "found code", code, len(sms) if code in Lstops: continue Exists.add(code) foo = flws[code] # see if we have a login in user config, else pretend we do       # has to be done before any call, or login status gets confused! if code not in usernames['wiktionary']: usernames['wiktionary'][code] = "Interwicket"

# set delete for anything not in matrix: for lc in flws: if lc not in sms: flws[lc].deletecode = True with plock: print "found %d active wikts" % len(Exists) if len(Exists) < 150: return

for lc in Exists: site[lc] = wikipedia.getSite(lc, "wiktionary") naps[lc] = 0 # nil, might be referenced by hunt

with plock: print "starting ..."

# start task generation threads, then yield queue entries:

tt = threading.Thread(target=addtasks) tt.daemon = True # kill silently on exit (:-) tt.name = 'get link tasks' tt.start

rt = threading.Thread(target=recent) rt.daemon = True # kill silently on exit (:-) rt.name = 'get recent changes' rt.start

dt = threading.Thread(target=deltasks) dt.daemon = True # kill silently on exit (:-) dt.name = 'delete scan' dt.start

nt = threading.Thread(target=nulltask) nt.daemon = True # kill silently on exit (:-) nt.name = 'null task generator' nt.start

# now "hunter tasks"

for i in range(1, 8): ht = threading.Thread(target=hunter) ht.daemon = True ht.name = 'hunter %d' % i       ht.start

nt = 0

while True: task = tasq.get

if task.src == 'null': with plock: print '(null points r %.4f on queue %.2f seconds clock %.1f queue %d ' \                     'hunt %d replink %d tick tock %.1f)' \ % (task.r, time - task.onq, clock - kloset, tasq.qsize,                        huntq.qsize, toreplink.qsize, getticktock) continue

# queue limit from/for addtasks (-) if task.src == 'idx': with qpwlock: qpw[task.home] -= 1

nt += 1

# Task: with plock: print nt, '('+task.src+')', task.home, srep(task.title), \ "links", task.nk, "random", "%.4f"%(task.r), "queue", tasq.qsize

# locals, and coerce types home = task.home title = task.title ul, ur = miget(title) lcs = set(ul) urs = set(ur) lcs.discard(home) urs.discard(home)

mysite = wikipedia.getSite(home, 'wiktionary') page = wikipedia.Page(mysite, fixtitle(title)) task.page = page title = task.title = page.title

if ':' in title: continue # redundant, but eh? if title.lower == 'main page': continue if not title: continue

# with plock: print "%s:%s" % (home, srep(title))

# structure of code here is leftover from source (-) tag = True if tag:

# ... pick up current version

try: # text = page.get text = getwikitext(page, plock = plock) oldtext = text if isblank(text, page): # we don't want to update other entries, but treat this as missing # we will look at it again every few days, it may then have content with plock: print "   ... page is effectively blank" midel(home, title) text = '' except wikipedia.NoPage: with plock: print "   ... %s not in %s.wikt" % (safe(page.title), safe(home)) midel(home, title) # if task.src == 'del' and lcs: if lcs: # hmmm...                   # others? home = lcs.pop if flws[home].status in ['bot', 'globalbot', 'test', 'noflag']: # requeue to ourselves: (can happen more than once) task.home = home task.src = 'delrq' # hmmm... tasq.put(task) text = '' except wikipedia.IsRedirectPage: with plock: print "   ... redirect page" mired(home, title) text = '' except KeyError: # annoying local error, from crappy framework code with plock: print "KeyError" sleep(20) continue except Exception, e:               with plock: print "unknown exception from getwikitext", repr(e) sleep(30) continue

if not text: continue

# if case was delete, and exists, we are done # this covers the Malayam (ml) Unicode 5.1 force case, page appears to exist if task.src == 'del': with plock: print "   ...", srep(title), "exists now" continue

act = ''

# use our newer code, not framework ls = getiwlinks(text, flws).keys

# special case for pl here ... for lc in flws[home].nolink: if lc not in ls: lcs.discard(lc)

# wikt links to redirs if flws[home].redirs: lcs |= urs

# list of iwikis in entry should match lcs, if not, we need to update if sorted(ls) == sorted(lcs) and not task.force: with plock: print "   ...", srep(title), "is okay" miadd(home, title) # ensure present in rc case (added with iwikis?) continue

# if not always adding redirs to this wikt, but some present, is ok           # also nolink wikts if (not flws[home].redirs or flws[home].nolink) and not task.force: ok = True # need to remove something for s in ls: if s not in lcs and s not in urs and s not in flws[home].nolink: ok = False # need to add something for s in lcs: if s not in ls: ok = False if ok: with plock: print "   ...", srep(title), "is okay (may have redirects or nolinks)" miadd(home, title) continue

# go hunt down some iwikis, add reciprocals when needed

with plock: print "   ... hunting iwikis for", srep(title) sleep(huntq.qsize*5) # q limit to reasonable? huntq.put(task)

# loop on task ends

# done

def hunter:

while not Quit: task = huntq.get

# locals, and coerce types home = task.home title = task.title page = task.page

links, redirs, complete = hunt(page) if Quit: break # return from hunt will not be valid

# and update this page: addrci(page, flws[home].site, links = links, redirs = redirs, remove = complete)

# record this title as done, links and redirs known if complete: ul = set(links.keys) ul.add(home) ur = set(redirs.keys) # sorted is nice, and makes lists again miset(title, sorted(ul - ur), sorted(ur)) # else it will get done again at some point, hopefully without exceptions

with plock: print "(hunter thread ending)"


 * 1) wiki-hunt ... see if a word is in other wikts, return list ...

def hunt(page):

word = page.title text = getwikitext(page, plock = plock) # will just return _contents home = page.site.lang

ul, ur = miget(word) totry = set(ul) | set(ur)

done = set fps = set links = { } redirs = { }

# reiw = re.compile(r'\[\[([a-z-]{2,11}):' + re.escape(word) + r'\]\]')

# simple scan for existing iwikis, use improved code

# for lc in reiw.findall(text): iws = getiwlinks(text, flws) for lc in iws: lc = str(lc) # not unicode # if lc in site: totry.add(lc)

# not home: totry.discard(home) done.add(home)

exceptions = False

while totry: lc = totry.pop if flws[lc].lockedwikt or flws[lc].deletecode: continue

if Quit: return None, None, False

try: fpage = wikipedia.Page(site[lc], word) text = getwikitext(fpage, plock = plock) if isblank(text, fpage): # we don't want to link to entirely blank pages with plock: print "      ", srep(word), "in", lc, "is blank or empty" done.add(lc) continue # not adding to links with plock: print "      ", srep(word), "found in", lc        except wikipedia.NoPage: with plock: print "      ", srep(word), "not in", lc            done.add(lc) continue except wikipedia.IsRedirectPage: redirs[lc] = fpage with plock: print "      ", srep(word), "found in", lc, "(redirect)" except Exception, e:           exceptions = True with plock: print "exception testing existence of word", str(e) done.add(lc) continue

done.add(lc) links[lc] = fpage

# add to list to add reciprocal link, or complete set, don't (can't :-) update redirects if lc not in redirs: fps.add(fpage)

# look for iwikis in the page, add to to-be-tried if not already done

iws = getiwlinks(text, flws) for lc in iws: lc = str(lc) # not in unicode if lc not in done and lc not in totry: with plock: print "           found further iwiki", lc                totry.add(lc)

# all done, now add reciprocals # don't remove anything if there were exceptions because hunt may be incomplete # if no exceptions, hunt is complete for these entries (there may be others not seen,   # but then they aren't linked, as we've looked at all links ...), so remove any # links not found:

for fpage in fps: if Quit: return None, None, False addrci(fpage, site[home], links=links, redirs=redirs, remove=not exceptions)

# return list of all links and redirects, and flag if complete return links, redirs, not exceptions


 * 1) end? Finally?

if __name__ == "__main__": try: main except KeyboardInterrupt: print "(keyboard interrupt)" # mostly just suppress traceback except Exception, e:       print "exception", repr(e) finally: Quit = True replink(end = True) sleep(210) # give a bit of a chance for add tasks/hunt/rc to stop cleanly for i in range(0, 26): print "closing index file for", 'hijklmnopqrstuvwxyzabcdefg'[i] sleep(1) # time for print with milock: mifs[i].close lru.close wikipedia.stopme