User:Keffy/Great Pronunciation Flood

Current status

 * Words transcribed: about 23,000
 * Latest word: packrat
 * Transcriptions uploaded: 19

as of March 26, 2006

Introduction
The Great Pronunciation Flood is a long-term project to automatically add consistent pronunciation entries to all Wikipedia articles, using data from off-line transcription tables that I have been building.

This page is to give an overview of my plans and solicit suggestions for improvement before I put too much more work into it. Until these plans get some sort of community consensus behind them, they will not happen.

Each added pronunciation entry will have:
 * an IPA transcription of a Canadian pronunciation (well, my Canadian pronunciation), using a consistent transcription system.
 * a link to an Ogg audio file on Wikimedia Commons of a speech synthesizer pronouncing the word. (I'm having Festival read the transcriptions aloud to me anyway as part of the proofreading process.  Might as well save to files and upload while I'm at it.)  Ideally, these synthesized files will gradually be replaced by files of live people with comparable pronunciations, but in the meantime, channelling Steven Hawking is better than nothing.

The format of a pronunciation line will be: IPAc: where IPAc will be a link to the pronunciation key and a brief explanation (currently residing under my user page).

Example words that have a pronunciation entry in this style are:
 * amnesia
 * angst
 * electroencephalograph
 * biodiversity

Progress
(As of March 26:) Everything necessary for proofreading and synthesizing is now in place (including the MySQL database to keep track of which words have been proofed, which have been synthesized, which have been uploaded to Commons and to Wiktionary). The Wiktionary bot has been written and reasonably thoroughly tested. Voting on a bot flag is now taking place in the Beer Parlour Go vote.

Two thousand entries with audio are languishing on my hard drive, waiting to be adopted into a good home.

Bot design
There will be two bots.

The Commons bot (just me running the Commonist program [|Commonist] from the Commons account User:KeffyBot) will upload the Ogg file of the synthesized pronunciation to Wikimedia Commons. Filenames will have the prefix "en-ca-synth-" [or whatever the Commons gurus tell me to use].

The Wiktionary bot will add the bulleted pronunciation entry as the last line of the existing third-level Pronunciation section, or create a third-level Pronunciation heading if none exists. The first pass will add only to words with no existing pronunciation entry.

Articles that look like they're too complicated or too dysfunctional for automatic processing will be skipped and processed by hand later. Criteria for skipping an article will include: plus any word that I've tagged as having multiple pronunciations depending on part of speech (e.g., noun vs. verb convert).
 * no second-level English heading
 * two third-level Pronunciation headings
 * two third-level Etymology headings
 * a fourth-level Pronunciation heading
 * ... add any other conditions you can think of here ...

After the initial backlog is cleared, the bot will add a couple of hundred pronunciations per night.

IPAc key
Each transcription consistently uses the following symbols for the sounds of Canadian English.

Other symbols:

(Table cribbed mostly from Pronunciation guide, with brutal deletions.)

A few brief notes on the dialect transcribed:
 * There is no contrast between [ɑ], [ɒ], [ɑː], and [ɔ]. Cot and caught are both [kɑt].
 * There is no contrast between [w] and [ʍ]. Witch and which are both [wɪtʃ].
 * There is no [ær]; it has merged into [ɛr]. Mary, merry, and marry are all [ˈmɛri].
 * There is no contrast between [ə] and unstressed [ɪ] or [ɨ]. Not all speakers have such a contrast (more to the point, I don't).  Even speakers who do have a contrast can't agree on which word has which.