Wiktionary:Corpora

This page is dedicated to listing collections of texts useful for the work of creating a dictionary. These collections are often known as "corpora" or less commonly "corpuses". Many of them feature functions like full-text search, term frequency information and collocation search.

For a more user-friendly introduction to some of the most prominent corpora, as well as other resources like dictionaries, see Quotations/Resources. Another page, Searchable external archives also contains information with a more specific focus on those which can solidly provide citations passing Wiktioanry's criteria for inclusion.

Note that corpora that contain text in multiple languages but where English text makes up of a significant portion of the corpora are listed in the English table below with their "Dialect" in the listing including the word "Multilingual".

If there are any other resources that you know of which aren't listed here, please do add them or suggest them on the talk page.

English
^ Go back to top

Non-English
^ Go back to top

Glossary
^ Go back to top

The following is a brief explanation of how various terms are used in describing and categorizing the corpora on this page.


 * Access restrictions: Any barriers to accessing the resource's contents, such as registration or paying a subscription. A number of resources can be accessed through the Wikipedia Library for free.
 * Apprx.: "Approximately", used to indicate that a date or quantity was estimated or not exactly known when inputted, but a best guess was given.
 * Available medium: The format through which the language can be accessed in the resource, such as written text or as a spoken and recorded in a video.
 * Esp.: "Especially", used to qualify the most common quality of a corpus, event if there are notable exceptions.
 * Hyphen (-): The symbol "-" is used in tables for information about a corpus that cannot be readily determined or approximated.
 * Library: Collection of texts gathered with a wide net and without linguistics work particularly in mind. It must be possible to search the contents of these texts.
 * Original Medium: The way the language was originally produced, whether it spoken, written, etc.
 * Question mark (?): The symbol "?" is used tables for information about a corpus that has not yet been determined, but probably could be.
 * Social media: A live website or other online center for mass user communication, or the attempt at a near complete archive of such. If the resource is an archive with a particular focus, then it is considered a library or corpus.
 * Re-use restrictions: Unique restrictions on the distribution of the resource's contents beyond general copyright law, in particular restrictions on commercial use or to academic users only. This restriction is particularly relevant to Wiktionary were all content must be able to be redistributed commercially per Wiktionary's CC BY-SA 4.0 license.
 * Strikethrough : Resources with their name's crossed out with a strikethrough were nonfunctional or otherwise broken at the time of the entry's last update.
 * Tagged Corpus: Collection of texts gathered within a specific scope with linguistics work at least partly in mind. The contents of the texts are marked by part of speech, meaning, pragmatics, or any other method.
 * Text: A continuous use of language published, released, or spoken as a coherent work. This could be a forum post in a thread, a book, an issue of a magazine, or a speech.
 * Untagged Corpus: Collection of texts gathered within a specific scope with linguistics work at least partly in mind. The contents of the texts are not marked by part of speech, meaning, pragmatics, or any other method.

Other lists and databases
^ Go back to top See also Searchable external archives and Quotations/Resources.