User:Robert Ullmann/sandbox

English Wiktionary recovery procedure for "Serbo-Croatian" entries

Objective: to restore language sections and content removed by several editors intent on merging Croatian, Bosnian, and Serbian into "Serbo-Croatian".

Process steps and notes:


 * 1) Read XML dump, identify entries with missing language sections.
 * 2) * if entry does not have a "Serbo-Croatian" language header, skip
 * 3) * if entry title is Cyrillic, check if Serbian is missing
 * 4) * if entry title is Latin, check if any of [ Bosnian, Croatian, Serbian ] missing
 * 5) Get current entry from wikt.
 * 6) Identify each missing section.
 * 7) Look for old section in entry history, restore it if found.
 * 8) * read a reasonable number of past revisions from API, find most recent section for language
 * 9) * add the section to the current entry text
 * 10) Else: generate a new language section, changing templates etc.
 * 11) * attempt to determine from the "Serbo-Croatian" language section which languages should be present
 * 12) * for each (Serbian if Cyrillic, all 3 if Latin), copy the section
 * 13) * correct template language codes (various cases, sh/hbs to the appropriate code)
 * 14) * delete references to Cyrillic from Bosnian and Croatian entries
 * 15) * apply modifications to POS template(s), as the languages are different
 * 16) * replace language name in appropriate categories
 * 17) * tag with attention template if heuristics show differences in content given (definition for only only language, etc)
 * 18) If entry modified by the above, save with comment explaining which sections were restored or created.
 * 19) * tag for AF to re-sort languages and clean up spacing
 * 20) Delete "Serbo-Croatian" section (if desired)
 * 21) * if it can be determined that the appropriate sections exists, and content diffs show all information is present, remove redundant SC section
 * 22) Read XML again (or on the same pass), for English entries, identify entries and Translations sections
 * 23) * check if "Serbo-Croatian" in some table, else skip
 * 24) Read entry from wikt
 * 25) Find old revisions with tables
 * 26) * if found, re-insert Bosnian, Croatian, and Serbian translations lines
 * 27) Create missing translations lines
 * 28) * copy "Serbo-Croatian", drop Cyrillic forms for Croatian and Bosnian
 * 29) * add as ttbc?
 * 30) Add comment, and save
 * 31) * tag for AF to re-sort translations tables
 * 1) Add comment, and save
 * 2) * tag for AF to re-sort translations tables

The two passes would, in the implementation, be done together. It isn't clear how much damage has been done to translations; it might be sufficient to just create a list of entries to be manuaull sorted.

The creation of new sections is very optional; it seems useful, but can be left out. The important part is to restore the language sections improperly deleted. This is very straightforward; the only true error case is restoring a section deleted because the word actually does not exist in the given language.

Generating a report (other than the standard logs) would probably be a very good idea.