Wiktionary talk:Statistics

"good" and "bad" entries
Can we come up with different adjectives here? I think "good" and "bad" may be significantly misleading to the uninitiated. I had proposed "interesting" and "uninteresting", but Dangherous didn't like them. —scs 00:08, 30 June 2006 (UTC)


 * How about describing them as "entries with wikilinks" and "entries without wikilinks".
 * Also, how about removing the "mostly redirects" comment. It used to be true, but I'm not so sure that it is, now.  --Connel MacKenzie T C 05:36, 30 June 2006 (UTC)


 * It's not about wikilinks... is it? They are mostly redirects I guess, but up to now I haven't found any decent descreption of what is considered an entry and what not. Since we do have a huge amount of redirects, I expect them to be the majority. Now, "good" and "bad" are the terms that have always been used. Never mind, though, it's just a detail. Call them "empyreal" and "purgatorial" if you like. — Vildricianus 08:46, 30 June 2006 (UTC)

scrunch up a little
For all namespaces after NS:1=Talk, why not just combine two rows into one, and add a "Talk" column for them? (I'm tempted to suggest that all except the subtotals should have Show/hide type auto-hiding.) --Connel MacKenzie T C 05:38, 30 June 2006 (UTC)
 * The show/hide crap will make it complicated, but feel free to play around with the table. — Vildricianus 08:46, 30 June 2006 (UTC)

Take after the French
This page would be more interesting/helpful if it contained more information in an easier to use format, such as fr:Wiktionnaire:Statistiques. Jade Knight 19:43, 23 October 2006 (UTC)

Spanish and English statistics
Curiouser and curiouser... we now have more Spanish words than English ones. Beobach972 03:15, 24 October 2006 (UTC)
 * Right - on this iteration I did not exclude the "form of" templates. --Connel MacKenzie 03:16, 24 October 2006 (UTC)

Detail
I'm surprised to have gotten so little feedback on the "Detail" section. Perhaps the explanation is clear enough? Honestly, I expected somebody to ask why the numbers (say, for English) don't add up to get "Total definitions." (The answer is that "real definitions" is exclusive of the others, but something can count as an "inflected form" and as "slang" while actually being only one definition line.) I also kindof expected someone to ask why "total language sections" is so much higher than "real definitions" and so very much lower than "total definitions." I guess that is self-evident? --Connel MacKenzie 20:48, 23 May 2007 (UTC)

Translingual
What does it refer to exactly? Does translingual mean words which are used in more than one language? DaGizza 23:03, 10 January 2008 (UTC)
 * It refers to two main groups of things. 1) Symbols that don't really belong to any language at all (see %). 2) Taxonomic names (some people call them New Latin) that are used across all languages (that use the Roman script) (see (Homininae). SemperBlotto 23:11, 10 January 2008 (UTC)

I also used it on CCC which is initialism for Chaos Computer Club, which works in English and German, not sure if that was right. Mutante 23:16, 10 January 2008 (UTC)

Language codes
Could we add language codes to this data? I'm going to do so manually right now but it ought to be added to the script that generates this page too. &mdash; hippietrail 05:07, 3 February 2008 (UTC)

PAGESINCATEGORY:
I've converted vi:Wiktionary:Thống kê to use  for the language breakdown. It'd be a bit more difficult to do that here; for instance, Category:English language doesn't directly contain all English words, so you'd have to add up all the parts of speech. In any event, it'd be a nice extension to the automatically-updated Special:Statistics page. – Minh Nguyễn (talk, contribs) 22:08, 21 May 2008 (UTC)

Statistics update
Is this supposed to be updated so rarely? The last dump is 50 days old. --Vahagn Petrosyan 20:49, 4 March 2009 (UTC)


 * I could be wrong, but I think the question if one of responsibility. Connel took care of this page for a long time, but he has been mostly absent as of late, and not doing the updates.  Conrad did it a few times, and certainly has access to fresh dumps.  I suggest you nag him.  -Atelaes λάλει ἐμοί 20:55, 4 March 2009 (UTC)

please tell me...
What are "Form-of" definitions, and why has Mandarin only got 80 of them? Can someone please leave a message for me on my talk page about it? Cheers Tooironic 13:45, 21 November 2009 (UTC)
 * A "form of" definition consists of an entry that is defined solely as being a "form" of another word. For example, each English noun has a plural "form", and each English verb has a past, past participle, and present participle "form".  A Latin verb may have over 100 "forms" (see the links in the inflection table at, for example).  I suspect Mandarin doesn't have very many "form-of" entries because Mandarian verbs have oly a single form, which is the main entry form.  "Form-of" entries exist primarily in languages that conjugate their verbs or inflect their nouns and adjectives. --EncycloPetey 16:58, 21 November 2009 (UTC)

Gloss definitions
What is meant by "gloss definitions"? - -sche (discuss) 10:21, 8 February 2013 (UTC)
 * I think it's a definition that is not a "form-of" definition. Maro 18:46, 15 February 2013 (UTC)
 * See here: gloss. It'd be good to add this link to the table header: gloss

Fix grammar
"requests for definitions, this may divide things incorrectly"

This is a comma splice. Please change the comma to a semicolon or add "and" before "this." 2001:18E8:2:1020:1463:E53C:61CD:5659 15:37, 13 June 2013 (UTC)

English lemmata
In June of 2012, Ruakh counted how many English lemmata Wiktionary covered in three different ways. See [//en.wiktionary.org/w/index.php?title=Wiktionary:Requests_for_deletion/Others&curid=609185&diff=22000159&oldid=21971911#Wiktionary:Page_count here]. "Approach 1 gave 298,322; Approach 2 gave 299,516" and approach 3 (which lumped different parts of speech together, rather than considering them separate lemmata) gave 133,470. - -sche (discuss) 04:51, 30 August 2013 (UTC)

How does Latin have more entries and definitions than English?
How does a long-dead foreign language get more stuff here than the current, wider used, actual language of this wiktionary?-47.20.162.183 00:20, 17 June 2014 (UTC)
 * Latin words have loads of inflected forms. — Ungoliant (falai) 00:21, 17 June 2014 (UTC)
 * Thanks! :)-47.20.162.183 01:30, 17 June 2014 (UTC)
 * I prefer using the gloss definitions column as a measure of how much content we have in a given language. The entries and definitions columns are heavily biased towards languages with complex inflection. Poor English, with its 4~5 inflected verb forms, stands no chance against Latin, which has over 100. — Ungoliant (falai) 01:36, 17 June 2014 (UTC)
 * Maybe the gloss definitions column should be first one or should be given prominence in some other way. --Vahag (talk) 08:23, 17 June 2014 (UTC)
 * I support that idea. If no one objects I’ll change the format for the next dump. — Ungoliant (falai) 13:10, 17 June 2014 (UTC)
 * No objection, but if "gloss definitions" is moved to come after "definitions", the latter should probably be renamed "total definitions" in the interest of clarity. Actually, as long as things are being changed around, could you also put a 1 or something after gloss definitions, so it can be linked to an explanation [//en.wiktionary.org/w/index.php?title=Wiktionary%3AStatistics&diff=27191494&oldid=27032083 like this]? Given that even I who edit this dictionary had to ask what the term meant, the number of passersby who know what it means is probably small enough to make it worth a footnote. - -sche (discuss) 15:23, 17 June 2014 (UTC)
 * While we’re at it, if there is any other layout change anyone wants to propose, speak up. I’m thinking of moving the data of appendix defs/entries to the same columns as the non-appendix data, since most languages have 0 anyway. — Ungoliant (falai) 15:43, 17 June 2014 (UTC)
 * Now that we have categories for every language called "Foo lemmas" and "Foo non-lemma forms", maybe the number of pages in each of those categories for each language could be added to the table. —Aɴɢʀ (talk) 20:35, 21 December 2014 (UTC)

Translation statistics
I’ll be keeping translation statistics at this page. — Ungoliant (falai) 15:54, 28 July 2015 (UTC)
 * I'm gonna bookmark that :) —Aryamanarora (मुझसे बात करो) 22:05, 8 December 2015 (UTC)
 * Good stats, thanks! Russian at #2, after Finnish (60,823 translations). Not bad at all! --Anatoli T. (обсудить/вклад) 23:13, 8 December 2015 (UTC)
 * Finnish is a surprise to me - and then there's Hindi, somewhere in the 40's. —Aryamanarora (मुझसे बात करो) 21:39, 3 January 2016 (UTC)

Statistics on Sindhi language
The information on Sindhi language is NOT correct even as of 2-12-2015. There were more than 1000 definitions in Sindhi wiktionary on that date. Please fix the error.

Aursani (talk) 09:57, 21 December 2015 (UTC)
 * This information is about English Wiktionary only. — Ungoliant (falai) 13:50, 21 December 2015 (UTC)

Statistics on lemmas and non-lemmas
I think it would be useful if the statistics included measures on how many lemma and non-lemma entries have been created or removed. Right now there is only a generic "entries" column, but that includes all entries, and I don't know if it distinguishes cases where a new lemma POS section has been added to a page that already has a section for the current language. That is what I would consider an "entry", a single page can have multiple entries in one language. —CodeCat 21:35, 22 February 2016 (UTC)

Lemmas pie chart
Numbers from subcategories of Category:Lemmas by language, code copied from mw:Extension:Graph/Demo/CategoryPie: { "version": 2, "width": 300, "height": 300, "data": [ {     // Data is retrieved from the MediaWiki API. "name": "table", "url": "wikiapi:///?action=query&prop=categoryinfo&titles=Category:English lemmas|Category:Italian lemmas|Category:Mandarin lemmas|Category:Finnish lemmas|Category:Chinese lemmas|Category:French lemmas|Category:Serbo-Croatian lemmas|Category:Portuguese lemmas|Category:Cantonese lemmas|Category:Japanese lemmas|Category:Spanish lemmas|Category:German lemmas|Category:Russian lemmas|Category:Latin lemmas|Category:Czech lemmas|Category:Dutch lemmas|Category:Korean lemmas|Category:Min Nan lemmas|Category:Hungarian lemmas|Category:Swedish lemmas|Category:Greek lemmas|Category:Polish lemmas|Category:Esperanto lemmas|Category:Georgian lemmas|Category:Irish lemmas|Category:Macedonian lemmas|Category:Catalan lemmas|Category:Vietnamese lemmas|Category:Hakka lemmas|Category:Armenian lemmas|Category:Telugu lemmas|Category:Icelandic lemmas|Category:Norwegian Bokmål lemmas|Category:Romanian lemmas|Category:Latvian lemmas|Category:Turkish lemmas|Category:Scottish Gaelic lemmas|Category:Norman lemmas|Category:Translingual lemmas|Category:Arabic lemmas|Category:Danish lemmas|Category:Persian lemmas|Category:Norwegian Nynorsk lemmas|Category:Old French lemmas|Category:Hebrew lemmas|Category:Ancient Greek lemmas|Category:Manx lemmas|Category:Old Armenian lemmas|Category:Hindi lemmas|Category:Albanian lemmas|Category:Ido lemmas|Category:Galician lemmas|Category:Luxembourgish lemmas|Category:Asturian lemmas|Category:Thai lemmas|Category:Old English lemmas|Category:Faroese lemmas|Category:Estonian lemmas|Category:Middle Chinese lemmas|Category:Old Chinese lemmas|Category:Adyghe lemmas|Category:Navajo lemmas|Category:Bulgarian lemmas|Category:Malay lemmas|Category:Proto-Germanic lemmas|Category:Malagasy lemmas|Category:Slovene lemmas|Category:Middle French lemmas|Category:Yiddish lemmas|Category:Slovak lemmas|Category:Swahili lemmas|Category:Norwegian lemmas|Category:Lithuanian lemmas|Category:Welsh lemmas|Category:Ukrainian lemmas|Category:Aromanian lemmas|Category:Volapük lemmas|Category:Classical Nahuatl lemmas|Category:Old Church Slavonic lemmas|Category:Venetian lemmas|Category:Sanskrit lemmas|Category:Classical Syriac lemmas|Category:Northern Sami lemmas|Category:Crimean Tatar lemmas|Category:Indonesian lemmas|Category:Urdu lemmas|Category:Lojban lemmas|Category:Afrikaans lemmas|Category:Bengali lemmas|Category:Old Saxon lemmas|Category:Belarusian lemmas|Category:Romansch lemmas|Category:Aramaic lemmas|Category:Kurdish lemmas|Category:Tagalog lemmas|Category:Middle English lemmas|Category:Khmer lemmas|Category:Wu lemmas|Category:Old Irish lemmas|Category:Basque lemmas|Category:Quechua lemmas|Category:Bashkir lemmas|Category:Friulian lemmas|Category:Occitan lemmas|Category:Maltese lemmas|Category:Old Norse lemmas|Category:Ladin lemmas|Category:Lower Sorbian lemmas|Category:Interlingua lemmas|Category:Ladino lemmas|Category:Hawaiian lemmas|Category:Mongolian lemmas|Category:Azeri lemmas|Category:Veps lemmas|Category:Vilamovian lemmas|Category:Cebuano lemmas|Category:Proto-Slavic lemmas|Category:Hiligaynon lemmas|Category:Breton lemmas|Category:Lao lemmas|Category:West Frisian lemmas|Category:Old High German lemmas|Category:Pashto lemmas|Category:Pali lemmas|Category:Dalmatian lemmas|Category:Sicilian lemmas|Category:Tajik lemmas|Category:Saterland Frisian lemmas|Category:Haitian Creole lemmas|Category:Mapudungun lemmas|Category:Greenlandic lemmas|Category:Tok Pisin lemmas|Category:Chuukese lemmas|Category:Tamil lemmas|Category:Burmese lemmas|Category:Kabardian lemmas|Category:Neapolitan lemmas|Category:Gothic lemmas|Category:Nepali lemmas|Category:Scots lemmas|Category:Proto-Samic lemmas|Category:Min Dong lemmas|Category:Egyptian lemmas|Category:Ottoman Turkish lemmas|Category:Somali lemmas|Category:Novial lemmas|Category:Low German lemmas|Category:Proto-Indo-European lemmas|Category:Old Portuguese lemmas|Category:Proto-Finnic lemmas&formatversion=2&format=json", // We are only interested in the content of query.pages subelement. "format": {"property": "query.pages","type": "json"}, "transform": [ // sort in descending order using category size as the sort key {"type": "sort","by": "-categoryinfo.size"}, // To visualize, use "pie" transformation to add layout_start, layout_end, and layout_mid fields to each page object // These fields contain angles at which to start and stop drawing arcs. First element's start will be 0, and last element's end will be 360 degrees (in radians) {"type": "pie","field": "categoryinfo.size"} ]   }  ],  // Scales are like functions -- marks use them to convert a data value into a visual value, like x or y coordinate on the graph, or a color value. "scales": [ {     // This scale will be used to assign a color to each slice, using a palette of 10 colors "name": "color", "domain": {"data": "table","field": "title"}, "range": "category20", "type": "ordinal" } ],  "marks": [ {     // This mark draws the actual pie chart from the data source // Each element is an arc between layout_start and layout_end angles (as calculated by the pie transformation) // drawn with a given radius, stroke, and fill. "from": {"data": "table"}, "type": "arc", "properties": { "enter": { "fill": {"scale": "color","field": "title"}, "outerRadius": {"value": 200}, "startAngle": {"field": "layout_start"}, "endAngle": {"field": "layout_end"}, "stroke": {"value": "white"}, "strokeWidth": {"value": 1} }     }    },    {      // This mark draws labels around the pie chart after the pie chart has been drawn // Before drawing, we need to perform a number of calculations to figure out the exact location and orientation of the text "from": { "data": "table", "transform": [ // For each data point (datum), each of these transformations will be ran in order. // Formula transformation evaluates the expression and assigns result to the datapoint // Size of the pie slice, in degrees: sliceSize = (end - start) * 180 / Pi          { "type": "formula", "field": "sliceSize", "expr": "(datum.layout_end - datum.layout_start)*180/PI" }, // Draw text only if the slice of the arc is more than 2 degrees to avoid overcrowding { "type": "filter", "test": "datum.sliceSize > 2" }, // Remove namespace from the text - keeps only text after the first ':' symbol, limits to 40 chars. { "type": "formula", "field": "title", "expr": "substring(datum.title, 1+indexof(datum.title,':'), 40)" }, // Determine the side of the pie chart we are on - left or right. { "type": "formula", "field": "invert", "expr": "datum.layout_mid*180/PI < 180 ? 1 : -1" }, // If on the left, the text should be right-aligned (go from the rim inward) { "type": "formula", "field": "align", "expr": "datum.invert < 0 ? 'left' : 'right'" }, // At what angle should the text be drawn relative to the point on the circle { "type": "formula", "field": "angle", "expr": "(datum.layout_mid*180/PI)-90*datum.invert" }, // Make font smaller for smaller pie slices { "type": "formula", "field": "fontSize", "expr": "datum.sliceSize > 20 ? 15 : (datum.sliceSize > 10 ? 14 : 10)" }, // Make font bold for largest pie slices { "type": "formula", "field": "fontWeight", "expr": "datum.sliceSize > 15 ? 'bold' : 'normal'" } ]     },      "type": "text", "properties": { "enter": { // Use the fields calculated in the transformation to draw category names "align": {"field": "align"}, "angle": {"field": "angle"}, "baseline": {"value": "middle"}, "fill": {"value": "black"}, "fontSize": {"field": "fontSize"}, "fontWeight": {"field": "fontWeight"}, "radius": {"value": 270}, "text": {"field": "title"}, "theta": {"field": "layout_mid"} }     }    }  ] } The chart updates automatically. Would it make sense to add this to the page? --Yair rand (talk) 04:52, 24 February 2016 (UTC)


 * Why does, eg, Spanish have 47,817 lemmas, German have 42,014, but Spanish doesn't show up on the chart? DTLHS (talk) 04:57, 24 February 2016 (UTC)
 * Hm. Might be an API limitation. It seems to be ignoring all languages past the first 500 in the list. I'll go ask the author of the chart template if there's any way to fix it. --Yair rand (talk) 05:07, 24 February 2016 (UTC)
 * Apparently it can't find more than 500 subcategories at a time, and it can't automatically just get the largest categories. I've changed it to a manual list of the largest 150. Unfortunately, this won't automatically add in new languages that enter the top 150. --Yair rand (talk) (not logged in) 14:34, 24 February 2016 (UTC)

how these column headers correspond to "etymology"s?
gloss definitions, entries, gloss entries, form definitions, total definitions - which of them is "etymologies"? --Qdinar (talk) 12:58, 6 February 2020 (UTC)

Pageview stats
I added some links to Wikimedia's pageview stats in Special:Diff/51268749/51268782, but it looks like they got removed (by a script?) in Special:Diff/57992037/58655295. – Jberkel 18:10, 17 February 2020 (UTC)
 * That was my fault. I accidentally edited WT:Statistics instead of WT:Statistics/generated when I added this month’s stats. — Ungoliant (falai) 00:52, 18 February 2020 (UTC)

Amharic Wiktionary counter
Just in the first page of list of words starting with "a" there are 345 words (| look at here) But still the counter of Amharic says 384 content! What is this madness! Abreham97 (talk) 00:34, 16 November 2021 (UTC)

Update?
How can we update the stats on Statistics/generated (currently from the 2022-01-01 dump)? A455bcd9 (talk) 12:44, 13 March 2022 (UTC)


 * Same issue for May :) A455bcd9 (talk) 07:22, 29 May 2022 (UTC)
 * poke @Ungoliant MMDCCLXIV. Would be amazing to have the code on GitHub or GitLab so that anyone can generate and update this page. A455bcd9 (talk) 07:23, 29 May 2022 (UTC)
 * @Ungoliant MMDCCLXIV Hi, I hope all is well. Could you please update the statistics or create a document explaining how to generate them so that anyone can run them in your absence? Thanks for any help you can provide. A455bcd9 (talk) 07:34, 14 July 2022 (UTC)

I must confess that I, too, am becoming slightly impatient. On the other hand, Ungoliant may just have quit, and we can force no user to stay active and keep things up to date. Maybe raise the issue centrally? Steinbach (talk) 14:36, 17 October 2022 (UTC)
 * I'm working on a replacement for Ungoliant's stats, the code will be hosted on gitlab/toolforge, to avoid this situation. However, it's not quite ready yet. – Jberkel 15:15, 17 October 2022 (UTC)
 * Thanks for your help @Jberkel. FYI the French Wiktionary has detailed statistics and they would be happy to help. A455bcd9 (talk) 13:54, 1 November 2022 (UTC)
 * How is it going, ? Steinbach (talk) 18:03, 3 January 2023 (UTC)
 * First iteration is now done. Jberkel 04:23, 11 March 2023 (UTC)
 * Thanks! A455bcd9 (talk) 10:50, 11 March 2023 (UTC)
 * Thanks indeed! Btw, what explains the apparent drop in the number of languages? Steinbach (talk) 14:56, 11 March 2023 (UTC) O, and can you provide the gitlab link? Steinbach (talk) 14:58, 11 March 2023 (UTC)
 * The repo: gitlab. It contains a lot more code than just the stats. The drop in language is probably because reconstruction and appendix namespaces are not included. This is a limitation of the HTML dumps, see Statistics. Jberkel 21:57, 11 March 2023 (UTC)
 * Thank you. I hope someone (either you or someone else) can fix that. Appendix-only languages were already hard to find, this makes them even less visible. Steinbach (talk) 11:25, 12 March 2023 (UTC)
 * I have another question for you. Is it right that there are no new languages? After sorting the table for "change in number of gloss definitions" I noticed that several languages had gone up from very few (often enough one or two) to a decent number, but none where entirely new (that is, change in number of gloss definitions equals number of gloss definitions). How does your script handle any new language headers, ? Steinbach (talk) 12:26, 16 March 2023 (UTC)
 * I didn't include a diff of new language headers in the output, but there were probably new headers. The stats generation tool reads all L2 headers and transforms them into language codes based on the data of Module:languages. Languages not listed in this module are ignored (usually typos or errors). For the next run I can include a diff. Jberkel 07:47, 24 March 2023 (UTC)
 * Regarding the missing Appendix/Reconstruction languages, it sometimes helps to signal your interest on phabricator (subscribing to the task etc) in order to get things moving a bit faster there. Jberkel 21:52, 25 March 2023 (UTC)