User:Dan Polansky/Thesaurus

Layout
Keeping it simple (see also Thesaurus/Format):

Hyponyms
...

Formatting of lists
Possible formatting of lists in Thesaurus entries:

1. Comma-separated list of Thesaurus:arrogance:

,, , , , , , , , , , , , ,

2. Bullet formatting of Thesaurus:arrogance, with three columns:



3. Tabular formatting of Thesaurus:arrogance:

Benefits over the mainspace
There are several benefits of having a dedicated namespace for a thesaurus rather than having all thesaurus information in the mainspace, in the sections of synonyms, antonyms, hyponyms, hypernym, meronyms, holonyms, instances, and see also.
 * 1) First, the current design of Thesaurus enables and tends toward one sense per entry, so the reader and the editor can focus on one location and its surroundings in the imaginary space of semantics. By contrast and example, "cat" now has a synonym section that can contain synonyms for an animal, a man, a spiteful woman, a prostitute, a vulva, and more. Furthermore, while the synonyms of a sense are grouped together in the synonyms section, all the term semantically related to the sense are scattered across the various sections for semantic relations, interlaced with the other senses.
 * 2) Second, a dedicated namespace enables focus on semantic relations to the exclusion of everything else, including etymology, pronunciation, definitions, example sentences, related terms, and derived terms, making it easier for the editor to use his imagination and memory to expand the entry.
 * 3) Third, the use (in each Thesaurus entry) of the little "[WS]" next to each word that links to another Thesaurus entry makes it much easier to browse the semantic network of words; in the mainspace, the reader does not know which listed items in synonyms, hyponyms and other sections lead to another interesting cluster, and which items lead only to an entry that does not lead anywhere. It is unclear how a similar thing could be implemented in the mainspace, other than manually tagging each link to another cluster.
 * 4) Fourth, if long lists of hyponyms are placed into the mainspace, they will be wanted to be placed into a collapsible box so as to make it easier to scroll to translations section located further below. However, these collapsible boxes make browsing the semantic network even slower, as the reader first has to expand several sections for semantic relations to see them all, often including hyponyms, meronyms, and holonyms, or even more sections if synonyms and hypernyms are also placed inside collapsible boxes. Each of the mentioned sections has to host several senses in the mainspace, further reducing the ease of browsing and user concentration, as mentioned in the first point.
 * 5) Fifth, keeping a list of synonyms (and hyponyms and other terms) at one place simplifies keeping the list up to date; in the mainspace, the lists of synonyms of a set of synonyms are scattered across several pages, and often soon diverge. This advantage of a dedicated namespace could be reduced or even removed by placing the likes of "See also miser " into the synonym sections of the synonyms of "miser", given "miser" is the main entry for the collection of synonyms. Nonetheless, the hyperlink would still only unreliably lead to the desired section, as one page often has several synonym sections. Another approach that could reduce this advantage would be the use of templates that contain the lists and get included in the respective sections, but the templates would have to be separate for synonyms, hyponyms, and other relations so as to be includable in the separate sections.
 * 6) Sixth, as a smaller editing advantage, moving semantic clusters around is much easier in a dedicated namespace: an editor just uses the move function of the wiki software.  In the mainspace, the revision history of an entry is tied to that entry's headword, and moving semantic clusters around is achieved only by copying the data without copying the history.  Admittedly, some editing operations in the dedicated namespace also involve copying of text without editing history, such as splits of pages.
 * 7) Seventh, a dedicated namespace makes semantically sum-of-partish headwords possible, such as "marijuana cigarette", "alcoholic beverage", or "beautiful woman". This advantage could be reduced by allowing some sum-of-partish headwords in the mainspace for the sake of thesaurus.

Examples

 * - an entry that has its own major sense, and another sense just redirecting to.

Design
Design principles:
 * Only a fraction of all the Wiktionary terms should have their own Thesaurus headword. An example: There should be Thesaurus:cat, but no Thesaurus:kitty. A model: Roget thesaurus.
 * Words should be grouped under shortened definitions of senses, not under words. Having the definition explicity stated in the L4 heading using " ======== " is key to accuracy.
 * A Thesaurus entry should be linked to from the mainspace of each of its listed synonyms. An example: The mainspace entry of "kitty" should link to "Thesaurus:cat" in the "====Synonyms====" section, using the text "See also Thesaurus:cat". It is "See also" and not "See" because Wiktionary mainspace should keep the synonyms rather than completely outsourcing the task to Thesaurus. Wiktionary's Synonyms section is the key point of entry to Thesaurus.
 * Included semantic relationships: synonymy, antonymy, hyponymy, hypernymy, meronymy, holonymy.
 * The "See also" section can contain links to other Thesaurus entries, and links to categories.
 * Each Thesaurus entry should ideally host exactly one sense, of one part of speech; see also . Currently, many extries host several senses on the same headword, such as the entry "Thesaurus:sound". A solution considered in the past: having headwords like "fat (obese)". This solution can be sometimes avoided by chosing a less ambiguous headwords, but whether this is going to be feasible remains to be seen.

Cluster thesaurus
I am pushing that Thesaurus is set up as a cluster thesaurus (such as the original Thesaurus of English Words and Phrases by Peter Mark Roget), rather than a list-per-each-word thesaurus (such as Moby II). There were dictionaries of synonyms before Roget, but none of them made such as splash as Roget did. Roget's thesaurus is not a dictionary of synonyms: within one category, terms are lumped together by various semantic relations that are not synonymy. So in the category or cluster 366 "Animal", we find a list for "mammal", "quadruped", "bird", "reptile", "fish", "mollusk", "work", "insect", ect., all of which are hyponyms of "animal" and none of which are synonyms of each other. In fact, given that Roget's thesaurus is a paradigm case of a thesarus as a word finder, the definition of a thesaurus as a dictionary of synonyms is rather inaccurate.

A cluster thesaurus, as I use the term here, is one in which most words do not have a dedicated entry. Instead, words are grouped under major senses: if there is an entry for "stingy", there does not need to be an entry for "miserly". Also, many entries are dominated by hyponymy rather than synonymy, such as the entry for "animal".

When a thesaurus is designed as list-per-each-word thesaurus, and does not even distinguish senses within the list, words can be hard to find. A particularly striking example is the entry "work" in Moby II with its over 1000 words mixed by various senses of "work".

Semantic relations
Hyponymy. Hyponymy in nouns captures a subclass relationship. If each individual that is a rabbit is also an animal, then "rabbit" is a hyponym of "animal". As not every animal is a rabbit, "rabbit" is more specific than "animal". Hyponymy should not be confused with instance-of relationship: Jupiter is a planet, and "Jupiter" is not a hyponym of "planet".

Meronymy. Meronymy in nouns captures any of various part-of relationships. If each atom of a finger is also an atom of the hand the finger is part of, "finger" is a meronym of "hand". If each sentence belonging to grammar is also a sentence of linguistics, then "grammar" is meronym of "linguistics". Other examples include branch and tree, person and group of people, and wheel and car.

Although both hyponymy and meronymy are thus defined using the set-theoretic subset relationship, hyponymy considers sets of individuals while meronymy considers sets of further undivided constituents of individuals--atoms. The atoms are chosen in part arbitrarily, and are in principle potentially further divisible; the role of the atom can be played by a molecule, by a physical atom, by an elementary particle or any other part that is small enough. While meronymy between wheel and car can be analysed in terms of molecules, meronymy between elementary particle and physical atom can be analysed in terms of quarks.

Hypernymy. Hypernymy is the inverse of hyponymy: if "rabbit" is a hyponym of "mammal", then "mammal" is a hypernym of "rabbit".

Holonymy. Holonymy is the inverse of meronymy: if "branch" is a meronym of "tree", then "tree" is a holonym of "branch".

Being an instance. Being an instance of is distinct from being a hyponym. Mars is a planet in that it is an instance of a planet: it is a particular individual physical object that can be classed under the head of "planet". By contrast, each planet is a heavenly body, so "planet" is a hyponym of "heavenly body"; the term planet does not denote a particular individual unless further extended such as "the heaviest planet". Set-theoretically, being an instance corresponds to the set membership rather than subset relationship.

Various. The head "Various" is meant for a broad set of further semantic relationships. One of them is near-synonymy: acetaminophen is not exactly an NSAID, but it shares some properties with NSAIDs and is likely grouped together with NSAIDs in the naive classification in the mind of the reader, so is well listed in the entry for NSAID, albeit not in the synonyms section. Domains of study such physics talk in and study key concepts such as mass, velocity and energy; the listing of key concepts in a Thesaurus entry for a domain seems worthwhile. A property noun such as Thesaurus:size can list the adjectives that describe that property, such as "big" and "small". A verb that refers to an activity such as Thesaurus:make clean can list materials and equipment useful or needed for the activity, such as "mop" and "vacuum cleaner". The selection of what to include and what to omit needs to be done in part arbitrarily, based on an assessment of the commonality and relevance of the term considered for inclusion. The head "Various" is contradistinguished from the head "See also" by including terms that are not necessarily included in Thesaurus as a dedicated headword; the section "See also" should only link to other Thesaurus entries.

Containment. The containment relationship is so far unsupported in Thesaurus. A cup of tea contains tea, but it does not consist of tea, so there is no meronymy between a cup of tea and tea, unlike between the cup and its handle; at least by one account. A river contains water but whether water is part of a river seems unclear. Blood vessels contain blood, but blood is not part of blood vessels. An apartment contains furniture, but it is unclear whether furniture is part of the apartment. It is usual for the content of a container to be contained only for a part of the lifetime of the container: the tea is drunked from a cup, a river can dry and thus lose its water, furniture in an apartment can be exchanged for new one, and blood circulates in the blood vessels, so a particular blood vessel contains different (non-self-same) blood cells at distinct time points. Whether blood consists of blood cells or contains blood cells is unclear; a pond likely contains fish rather than consisting of them. That said, containment could be marked as meronymy or as "various": it seems meaningful to navigate from blood vessel to blood, and it makes a bit of sense to navigate from cup to tea and milk, and from glass to wine, that is, from a drinking vessel to beverages. Experience can show that what seems impractical upon contemplation or speculation turns up to work quite well in practice; or not.

Troponymy. Troponymy is a specialization of hyponymy for verbs. Given that "to guttle" is more specific than "to eat", "to guttle" is a troponym of "to eat". The opposite of "troponym" is called "hypernym" by WordNet, a project that uses the term "troponym". I recommend to dispense with the term "troponym" altogether, using the term "hyponym" instead. Each verb has a corresponding noun that denotes the verb's action, and the relation between the nouns is called hyponymy anyway: "guttling" is a hyponym of "eating". There is no dedicated term for hyponymy for adjectives as opposed to nouns, so there does not need to be a dedicated term for verbs either, especially given the dedicated term has no dedicated antonym but uses the generic "hypernym".

Number of senses per entry
The current setup of Thesaurus enables several senses per Thesaurus entry. But I think there should be as few senses per Thesaurus entry as possible, ideally only one sense per entry. A Thesaurus entry does not stand for a term or a syntactic entity; it stands for a semantic entity or a semantic cluster. Thus, etymology and etymologically related terms make no sense in Thesaurus, as the headword is not really a proxy for itself but only for a sense or a concept. It is confusing to store several semantic clusters on one headword. Doing so facilitates the idea that each term from mainspace should have its own Thesaurus entry, which is a really poor design for many purposes I think. Doing so also confuses the hyponymy and meronymy relations, as they typically refer to senses through their headwords, and when more senses are on the same headword, the headword cannot be used to unambiguously refer to one sense. Referring to Thesaurus senses through their headwords from the mainspace is ambiguous when there are more senses in one Thesaurus entry.

Particularly many senses can be found in the entry "Thesaurus:sound". It now has 9 senses, whereas I think Thesaurus:sound should ideally host only one sense: a sensation perceived by the ear. The sense of sound as a body of water can be hosted in Thesaurus:inlet instead; the sense of sound as healthy or fit can be hosted in Thesaurus:healthy or Thesaurus:fit; etc. Other entries with many senses are Thesaurus:bad and Thesaurus:old.

WordNet is a model that has senses as central entities rather than having spellings or terms as central entities. I encourage setting up Thesaurus on the same principle: the central entity is a sense or a sense cluster, and that entity is hosted on a headword, but the headword is an accident of sorts: the cluster for child as a non-adult person can be hosted on "child" or "kid", and it does not really matter all that much on which of the two it is hosted. So the headword "Thesaurus:obvious" does not stand for "obvious" more than it stands for "apparent", "manifest", "evident", "palpable", etc. Thesaurus headwords could be even numbered, like "Thesaurus:sense267", but using natural-language headwords looks more natural and helps human memory. The downside of natural-language headwords is that they get confused with the subject of the entry, which they are not; it is the sense that is the subject of the entry, not the headword.

The requiment of one sense per headword raises the question of whether appropriate headword can always be found. So far it seems feasible to find a dedicated headword for each sense, without the need of disambiguating terms present in the headword. If this approach turns out to create more difficulties than expected, Thesaurus can start using headwords on the model of "rich (wealthy)", "rich, wealthy", "rich; wealthy" or of the sort. The need for this complication has as yet not been demonstrated.

Currently, there are over 70 Thesaurus entries with more than one sense.

Granularity
Thesaurus should IMHO refain from fine granularity. Thus, it should have WS:way but not WS:pathway; all the hyponyms of "way" in the sense of "place for passing from a place to another place" can be hosted on one page. Likewise, it should have WS:child but not WS:baby, WS:toddler, and Thesaurus:preteen. Nothing is helped by having many small pages with 5 to 7 items rather than having larger pages with 50 or more items, with groups of hyponyms separated by.

Project name
The current project name is "Thesaurus", based on the name "thesaurus" given by Peter Mark Roget to his classical word finder. Previously, the project was called Wikisaurus, which many people on the web understood to refer to an extinct reptile such as dinosaur. An alternative name that comes to mind is "Wikinyms", refering to synonyms, antonyms, hyponyms, meronyms and other *nyms.