User:OrenBochman/BetterSearch

Problem: Lucene search processes Wikimedia source text and not the outputted HTML.

 * 1) (Also) index output HTML file?

Problem: HTML also contains CSS, HTML, Script, Comments
Either index these too, or run a filter to remove them. Some Strategies are: (interesting if one wants to also compress output for integrating into DB or Cache.
 * 1) solution:
 * 1) Discard all markup.
 * 2) A markup_filter/tokenizer could be used to discard markup.
 * 3) Lucene Tika project can do this.
 * 4) Other ready made solutions.
 * 5) Keep all markup
 * 6) Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
 * 1) Selective processing
 * 2) A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
 * 3) This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).

Problem: Indexing offline and online

 * 1) real-time "only" - slowly build index in background
 * 2) offline "only" - used dedicated machine/cloud to dump and index offline.
 * 3) dua - each time the lingustic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible aproach would be
 * 4) production of a linguistic/entity data or a new software milestone.
 * 5) offline analysis from dump (xml,or html)
 * 6) online processing newest to oldest updates with a (Poisson wait time prediction model)

Problem: Lucene Best Analyzers are Language specific

 * 1) N-Gram analyzer is language independent.
 * 2) A new Multilingual analyzer with a language detector can produced by
 * 3) extract features from query and check against model prepared of line.
 * 4) model would contain lexical feature such as:
 * 5) alphabet
 * 6) bi/trigram distribution.
 * 7) stop lists; collection of common word/pos/language sets (or lemma/language)
 * 8) normalized frequency statistics based on sampling full text from different languages..

Problem: Search is not aware of morphological language variation

 * 1) in language with rich morphology this will reduce effectiveness of search. Hebrew, Arabic,
 * 2) index Wiktionary so as to produce data for a "lemma analyzer".
 * 3) dumb lemma (bag with a representative)
 * 4) smart lemma (list ordered by frequency)
 * 5) quantum lemma (organized by morphological state and frequency)
 * 6) lemma based indexing.
 * 7) run a semantic disambiguation algorithm (tag )on disambiguate
 * other benefits:
 * 1) lemma based compression. (arithmetic coding based on smart lemma)
 * 2) indexing all lemmas
 * 3) smart resolution of disambiguation page.
 * 4) algorithm translate English to simple English.
 * 5) excellent language detection for search.
 * metrics:
 * 1) extract amount of information contributed by a user
 * 2) since inception.
 * 3) in final version.

Developer/Admin Information

 * | media wiki manual
 * | extentions

Search Options
highlights:
 * | Search Extentions
 * | Extension MWSearch
 * | Lucene Search
 * | Extension:EzMwLucene
 * | Extension:SphinxSearch
 * 

Potential Contact People
| comitt capable developers, | irc:#mediawiki

Screened

 * Brion Vibber - lead dev
 * Multichill
 * Andrew Garrett - active paid developer
 * Roan Kattouw - usability initiative, previously lead developer and maintainer of the MediaWiki API.
 * Siebrand Mazeland

Unscreened

 * Seb35
 * Aryeh Gregor
 * David Richfield
 * Niklas Laxström - experienced MediaWiki developer

Misc

 * | zim offline format