Wiktionary talk:Unicode high characters

43 pages employing Unicode high characters in their title
This is in response to a message by Brion Vibber on the Wikitech mailing list. MySQL 5 is scheduled to come out of beta next month, and we're going to be looking at upgrading sometime in the coming months. Among other things we're probably going to want to start making use of the support for Unicode collation, so we can get better sorting and perhaps use it for case-insensitive matching.

There is however a compatibility issue: MySQL's Unicode support is limited to the 16-bit character range (basic multilingual plane), both for ucs2 and utf8 storage modes.

Characters beyond the BMP are relatively rare, but they do occur. Mostly in there are ancient/dead scripts, some invented scripts, and a bunch of rare Han characters which sometimes turn up in Chinese and Japanese.

This won't affect page _contents_; our content is stored in binary blobs and can have any wacky characters we want. But to support these high characters in page titles, usernames, and such might require jumping through a lot of hoops.

It would be relatively simple to disable use of titles and usernames with these high characters; to assess possible impact I did a check through all our current wikis and found 99 extant pages:

43 in en.wiktionary.org 31 in got.wikipedia.org 10 in la.wiktionary.org 9 in zh.wikipedia.org 3 in so.wikipedia.org 1 in en.wikibooks.org 1 in ja.wikipedia.org 1 in nl.wikibooks.org

I've put the full list of pages here: http://meta.wikimedia.org/wiki/User:Brion_VIBBER/Unicode_high_chars

Most of the en.wiktionary entries are individual letters in the Deseret and Shavian alphabets (invented alphabets for English; historical curiosities).

The Gothic alphabet is entirely in the high-character area, but it's a long-dead language and not exactly an active wiki. Perhaps we should just close it down...

Latin Wiktionary contains several Gothic terms...

Most of what we have is by User:Vladisdead who hasn't been around for a while. Unless you are equipped for UTF-16, you probably won't see the characters. It might be a few years before we can handle them properly; in the meantime I see no reason for keeping them. Eclecticology 05:08, 15 October 2005 (UTC)


 * Um, User:Vladisdead is the same person as User:Vlad and User:Ptcamn who posted on this very page just a week ago and shows up in the IRC channel occasionally. He is still very active on la.wiktionary.  I've found (through my own design of a Gothic font) that the browsers out now are entirely capable of handling the plane 1 characters, though fonts themselves may not easy to come by. At any rate, it also includes things like Linear B (for Mycenaean), and Old Italic (for Etruscan etc.).  I should say I don't like this suggestion of "all words in all languages, unless Unicode decided to assign their writing system codes higher than the arbitrarily set value of U+FFFF".  At the very least move the damn things to transliterated page titles instead of deleting them, for goodness' sake.  —Muke Tever 08:39, 15 October 2005 (UTC)


 * I see no problem with accomodating the technical limitations of article titles; for these few entries, the "proper" unicode (that only two people have the technology capable of viewing) can certainly be included in the main body of the article. We do something similar with that 1023 character long chemistry "word" too, don't we?  -- [ Connel MacKenzie] 15:01, 15 October 2005 (UTC)


 * Methionylglutaminyl...serine was the particular example I was thinking of. -- [ Connel MacKenzie] 15:19, 15 October 2005 (UTC)


 * Doubtful that it's "only two". At any rate, as these are international standards that are gaining support we can expect the number of people who can read them to increase: the principle being what they call on e2 noding for the ages, i.e. this information's value will increase over time, not decrease...
 * At any rate, I don't see any problem with accommodating for technical limitations either. But this isn't on Beer Parlour, or Requests for Cleanup, it's on Requests for Deletion, which is ridiculous.  —Muke Tever 17:54, 15 October 2005 (UTC)


 * Transliterated titles would be fine, but that doesn't help us with the articles that are for the Shavian and Deseret letters themselves. Do we even need articles for these letters when users won't know how to look them up anyway?  Will these articles give any more information than can be included in comprehensive Wikipedia articles about the alphabets?


 * It is about deleting pages, so it's not misplaced. If you feel a few can be fixed, do it.  The suggestion that more people will be able to read these is unfounded speculation.  Where are the big movements to promote the Shavian and Deseret alphabets. Eclecticology 18:24, 15 October 2005 (UTC)


 * Even after pages are moved to Romanized (UFT-16) pages, the redirects that remain on http://meta.wikimedia.org/wiki/User:Brion_VIBBER/Unicode_high_chars will still need to be deleted. Muke, (or anyone else that knows precisely how to clean these up) could you please list the ones you've corrected as you correct them?  -- [ Connel MacKenzie] 06:33, 16 October 2005 (UTC)


 * There isn't much Gothic on en. 𐌼𐌰𐌸𐌰 moved to maþa, 𐌷𐌿𐌽𐌳𐍃 to hunds, 𐍅𐌿𐌻𐍆𐍃 to wulfs.  But 𐌰 I can't move (a already exists), and  𐍈 I can't move either as ƕ, too, exists.  That's the mass of it.  Also moved Mycenaean 𐀀𐀵𐀫𐀦 to a-to-ro-qo.  Feh... The rest appear to be individual characters, and will most likely require merging with whatever articles will already exist at their standard transliteration equivalent.  —Muke Tever 18:36, 16 October 2005 (UTC)


 * Spoke prematurely -- I couldn't move 𐌼𐌰𐌸𐌰 either, with maþa being an Old English word as well.


 * I didn't say it was misplaced—I know you were talking about deleting the pages. (I was, however, saying it was needlessly antagonistic to suggest deleting the articles instead of downgrading them to the level of this planned regression.) And it is not "unfounded speculation" to say that Unicode support is increasing in the computing world: you used to be lucky to get more than 255 characters at a time, then the standard Windows fonts increased to include the whole of the WGL4, and nowadays you also get several fonts with higher alphabet ranges even than that out of the box.  And I very doubt that it is because of "big movements" to have e.g. the Thai alphabet supported in U.S. versions of Windows.  People expect their computers to just work, and as time goes on that is what they will do.  (As mentioned, all the major browsers already handle these characters; when they don't display it's because the system doesn't have any images to give them.)  As for "not knowing how to look them up", that's empty: the average user knows exactly as much about searching for Deseret as they do for Japanese.  When running across any string in an alphabet they don't know how to type, the search method is copy and paste.   —Muke Tever 18:16, 16 October 2005 (UTC)