Appendix talk:Mandarin Frequency lists/1-1000

Remove many obvious duplicates. Must be machine generated. Makaimariano (talk) 08:23, 18 December 2018 (UTC)

The list only contains 908 words, as shown by a simple copy paste into excel
 * You can search the real AS Corpus here: https://elearning.ling.sinica.edu.tw/eng_teaching.html
 * If you look at the first 300 words it's clear there are some errors in the wikimedia version, like the first two rows are missing. This probably explains the other missing 90-some entries. Would be great to see this realigned to the actual freq list, but would be a lot of work (and would undo a lot of the efforts to clean up the list that have already been done). --User6985 (talk) 23:10, 17 April 2021 (UTC)

-- what is det. short for? Please add a footnote for "det."

Several words appear to be listed multiple times (one right after another), using, as far as I can tell, the exact same form and pronunciation. What's up with that? 219.142.235.124 03:14, 17 September 2013 (UTC)
 * The format of the page is not perfect. The first form is in the Traditional Chinese, the second is Simplified Chinese. They often coincide. --Anatoli (обсудить/вклад) 03:37, 17 September 2013 (UTC)
 * I think he was referring to lines like 758 and 759. Jamesjiao → T ◊ C 04:24, 17 September 2013 (UTC)


 * Thanks, James. will have to fix duplicates and rearrange lists. (Will take a bit of time)--Anatoli (обсудить/вклад) 04:29, 17 September 2013 (UTC)

Change to table format
I think this would be more user friendly if formatted in a table like this:

This is much easier to read. Also, table formatting is preserved when pasting into spreadsheet programs, which is very useful for students who intend to study these words.
 * So I'm lost. Is the first column Simplified or Traditional?

Regex to convert to tabbed lines
You can convert to a table by first using a regular expression to find each piece, convert to tab separated items per line, and then changing that to a table, or directly importing the result into Excel, etc.

Find:

^(.+?)\,\s+?(.+?)\s+?(\(.+?\))\s+?(.+)$

Replace:

$1\t$2\t$3\t$4
 * This almost worked as is in gedit. Had to use \1 instead of $1 there but anyway, updated the article with a nice table. --Giszmo2 (talk) 22:20, 31 May 2019 (UTC)

Duplicates
Hi!, i found the following duplicates: numbers correspond to the current order, some entries were removed earlier so dont expect 9032 to be the 32th entry in the 9001-10000 section tho! i could probably try making a massive edit removing these but i doubt wikipedia is going to allow it tho 2803:9800:9504:7B33:4B6C:CC1A:D8EF:BD67 19:41, 1 August 2023 (UTC)


 * Moreover I found that there are a lot of non-consecutive repeats, for example :
 * i found through a script that there are 747 other repeats not in consecutive order. this raises the question of how to deal with word like that, which position should take precedence, the lower or higher frequency? my opinion is that the list should be reworked from scratch to remove all these inconsistencies. I could probably workout a little script to deal with this, If someone is willing to push the changes (i think if i tried making too radical changes its going to probably get rejected automatically as of now) 2803:9800:9504:7B33:4B6C:CC1A:D8EF:BD67 19:59, 1 August 2023 (UTC)
 * I looked at the original version of the 1001-2000 list, and it had lots of duplicates then, so the data was bad at the beginning. I would just keep the first listing of a word and leave words on the same page. — Eru·tuon 20:09, 1 August 2023 (UTC)
 * I looked at the original version of the 1001-2000 list, and it had lots of duplicates then, so the data was bad at the beginning. I would just keep the first listing of a word and leave words on the same page. — Eru·tuon 20:09, 1 August 2023 (UTC)