Wiktionary talk:Frequency lists/Vietnamese syllables

Ay != Ai
"Ai" and "Ay" represent different sounds in Vietnamese. For example, "bay" and "bai" are not homophones, but you've mistakenly tabulated them as the same syllable.

I understand you've tried to normalize spellings using I and Y, which are to some degree interchangeable in Vietnamese, but they're only interchangeable when they're the sole vowel in the word. They aren't interchangeable in diphthongs or triphthongs.

216.94.43.4 18:55, 4 July 2018 (UTC)


 * Yes, seconding this in 2022 Gavinkwhite (talk) 04:21, 23 May 2022 (UTC)

Methodology
Scanning the Vietnamese Wikipedia for syllables runs into some pretty significant problems:


 * A list of syllables is of limited utility for lexicography, because Vietnamese isn't a monosyllabic language, especially not in formal encyclopedic writing.
 * Links were broken apart, which missed many opportunities to identify compound words in proper nouns.
 * This particular execution evidently didn't exclude any templates (infoboxes, article issue warnings, categories), so the list is full of the contents of common templates like w:vi:Bản mẫu:Sơ khai (stub), w:vi:Bản mẫu:Wiki hóa (wikify), and w:vi:Bản mẫu:Đang diễn ra (current event).
 * It also didn't exclude any common headings expected in every article, such as "Xem thêm" (see also), "Tham khảo" (references), and "Liên kết ngoài" (external links). Other formulaic headings are apparent, such as "Dân số" (demographics, found in geographical articles).
 * Some non-Vietnamese names are included, such as "Lee" and "team".

This manually curated list of 78 stop words is more reliable as a starting point. This list is used by the KDE knowledge base's search engine, the Vietnamese Wiktionary (to avoid automatically linking these words), and ORES (for revision scoring). Although it's unordered and a very short list compared to what would be expected of a frequency list, I would expect any half-decent frequency list to include almost all these words toward the top, such as and, which don't even appear in the syllable list.

– Minh Nguyễn &#x1f4ac; 20:28, 18 July 2022 (UTC)


 * For another point of comparison, a more rigorous frequency list was generated in . This article is unfortunately copyrighted, but copyright wouldn't prevent us from gut-checking a list obtained through other means. Minh Nguyễn &#x1f4ac; 20:32, 18 July 2022 (UTC)