Wiktionary talk:Votes/2011-07/Redirecting single-character digraphs

Affected characters
I wrote a Perl script to go through UnicodeData.txt and NamesList.txt (from http://www.unicode.org/Public/UNIDATA/) and find each character that has a direct compatibility mapping to a sequence of multiple non-modifier letters, without any compatibility formatting tag. That turns out to include these 56 characters:


 * U+0132 &#x0132; LATIN CAPITAL LIGATURE IJ   &rarr;    U+0049 &#x0049;     U+004A &#x004A;
 * U+0133 &#x0133; LATIN SMALL LIGATURE IJ   &rarr;    U+0069 &#x0069;     U+006A &#x006A;
 * U+01C4 &#x01C4; LATIN CAPITAL LETTER DZ WITH CARON   &rarr;    U+0044 &#x0044;     U+017D &#x017D;
 * U+01C5 &#x01C5; LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON   &rarr;    U+0044 &#x0044;     U+017E &#x017E;
 * U+01C6 &#x01C6; LATIN SMALL LETTER DZ WITH CARON   &rarr;    U+0064 &#x0064;     U+017E &#x017E;
 * U+01C7 &#x01C7; LATIN CAPITAL LETTER LJ   &rarr;    U+004C &#x004C;     U+004A &#x004A;
 * U+01C8 &#x01C8; LATIN CAPITAL LETTER L WITH SMALL LETTER J   &rarr;    U+004C &#x004C;     U+006A &#x006A;
 * U+01C9 &#x01C9; LATIN SMALL LETTER LJ   &rarr;    U+006C &#x006C;     U+006A &#x006A;
 * U+01CA &#x01CA; LATIN CAPITAL LETTER NJ   &rarr;    U+004E &#x004E;     U+004A &#x004A;
 * U+01CB &#x01CB; LATIN CAPITAL LETTER N WITH SMALL LETTER J   &rarr;    U+004E &#x004E;     U+006A &#x006A;
 * U+01CC &#x01CC; LATIN SMALL LETTER NJ   &rarr;    U+006E &#x006E;     U+006A &#x006A;
 * U+01F1 &#x01F1; LATIN CAPITAL LETTER DZ   &rarr;    U+0044 &#x0044;     U+005A &#x005A;
 * U+01F2 &#x01F2; LATIN CAPITAL LETTER D WITH SMALL LETTER Z   &rarr;    U+0044 &#x0044;     U+007A &#x007A;
 * U+01F3 &#x01F3; LATIN SMALL LETTER DZ   &rarr;    U+0064 &#x0064;     U+007A &#x007A;
 * U+0587 &#x0587; ARMENIAN SMALL LIGATURE ECH YIWN   &rarr;    U+0565 &#x0565;     U+0582 &#x0582;
 * U+0675 &#x0675; ARABIC LETTER HIGH HAMZA ALEF   &rarr;    U+0627 &#x0627;     U+0674 &#x0674;
 * U+0676 &#x0676; ARABIC LETTER HIGH HAMZA WAW   &rarr;    U+0648 &#x0648;     U+0674 &#x0674;
 * U+0677 &#x0677; ARABIC LETTER U WITH HAMZA ABOVE   &rarr;    U+06C7 &#x06C7;     U+0674 &#x0674;
 * U+0678 &#x0678; ARABIC LETTER HIGH HAMZA YEH   &rarr;    U+064A &#x064A;     U+0674 &#x0674;
 * U+0EDC &#x0EDC; LAO HO NO   &rarr;    U+0EAB &#x0EAB;     U+0E99 &#x0E99;
 * U+0EDD &#x0EDD; LAO HO MO   &rarr;    U+0EAB &#x0EAB;     U+0EA1 &#x0EA1;
 * U+20A8 &#x20A8; RUPEE SIGN   &rarr;    U+0052 &#x0052;     U+0073 &#x0073;
 * U+2116 &#x2116; NUMERO SIGN   &rarr;    U+004E &#x004E;     U+006F &#x006F;
 * U+2121 &#x2121; TELEPHONE SIGN   &rarr;    U+0054 &#x0054;     U+0045 &#x0045;     U+004C &#x004C;
 * U+213B &#x213B; FACSIMILE SIGN   &rarr;    U+0046 &#x0046;     U+0041 &#x0041;     U+0058 &#x0058;
 * U+2161 &#x2161; ROMAN NUMERAL TWO   &rarr;    U+0049 &#x0049;     U+0049 &#x0049;
 * U+2162 &#x2162; ROMAN NUMERAL THREE   &rarr;    U+0049 &#x0049;     U+0049 &#x0049;     U+0049 &#x0049;
 * U+2163 &#x2163; ROMAN NUMERAL FOUR   &rarr;    U+0049 &#x0049;     U+0056 &#x0056;
 * U+2165 &#x2165; ROMAN NUMERAL SIX   &rarr;    U+0056 &#x0056;     U+0049 &#x0049;
 * U+2166 &#x2166; ROMAN NUMERAL SEVEN   &rarr;    U+0056 &#x0056;     U+0049 &#x0049;     U+0049 &#x0049;
 * U+2167 &#x2167; ROMAN NUMERAL EIGHT   &rarr;    U+0056 &#x0056;     U+0049 &#x0049;     U+0049 &#x0049;     U+0049 &#x0049;
 * U+2168 &#x2168; ROMAN NUMERAL NINE   &rarr;    U+0049 &#x0049;     U+0058 &#x0058;
 * U+216A &#x216A; ROMAN NUMERAL ELEVEN   &rarr;    U+0058 &#x0058;     U+0049 &#x0049;
 * U+216B &#x216B; ROMAN NUMERAL TWELVE   &rarr;    U+0058 &#x0058;     U+0049 &#x0049;     U+0049 &#x0049;
 * U+2171 &#x2171; SMALL ROMAN NUMERAL TWO   &rarr;    U+0069 &#x0069;     U+0069 &#x0069;
 * U+2172 &#x2172; SMALL ROMAN NUMERAL THREE   &rarr;    U+0069 &#x0069;     U+0069 &#x0069;     U+0069 &#x0069;
 * U+2173 &#x2173; SMALL ROMAN NUMERAL FOUR   &rarr;    U+0069 &#x0069;     U+0076 &#x0076;
 * U+2175 &#x2175; SMALL ROMAN NUMERAL SIX   &rarr;    U+0076 &#x0076;     U+0069 &#x0069;
 * U+2176 &#x2176; SMALL ROMAN NUMERAL SEVEN   &rarr;    U+0076 &#x0076;     U+0069 &#x0069;     U+0069 &#x0069;
 * U+2177 &#x2177; SMALL ROMAN NUMERAL EIGHT   &rarr;    U+0076 &#x0076;     U+0069 &#x0069;     U+0069 &#x0069;     U+0069 &#x0069;
 * U+2178 &#x2178; SMALL ROMAN NUMERAL NINE   &rarr;    U+0069 &#x0069;     U+0078 &#x0078;
 * U+217A &#x217A; SMALL ROMAN NUMERAL ELEVEN   &rarr;    U+0078 &#x0078;     U+0069 &#x0069;
 * U+217B &#x217B; SMALL ROMAN NUMERAL TWELVE   &rarr;    U+0078 &#x0078;     U+0069 &#x0069;     U+0069 &#x0069;
 * U+FB00 &#xFB00; LATIN SMALL LIGATURE FF   &rarr;    U+0066 &#x0066;     U+0066 &#x0066;
 * U+FB01 &#xFB01; LATIN SMALL LIGATURE FI   &rarr;    U+0066 &#x0066;     U+0069 &#x0069;
 * U+FB02 &#xFB02; LATIN SMALL LIGATURE FL   &rarr;    U+0066 &#x0066;     U+006C &#x006C;
 * U+FB03 &#xFB03; LATIN SMALL LIGATURE FFI   &rarr;    U+0066 &#x0066;     U+0066 &#x0066;     U+0069 &#x0069;
 * U+FB04 &#xFB04; LATIN SMALL LIGATURE FFL   &rarr;    U+0066 &#x0066;     U+0066 &#x0066;     U+006C &#x006C;
 * U+FB05 &#xFB05; LATIN SMALL LIGATURE LONG S T   &rarr;    U+017F &#x017F;     U+0074 &#x0074;
 * U+FB06 &#xFB06; LATIN SMALL LIGATURE ST   &rarr;    U+0073 &#x0073;     U+0074 &#x0074;
 * U+FB13 &#xFB13; ARMENIAN SMALL LIGATURE MEN NOW   &rarr;    U+0574 &#x0574;     U+0576 &#x0576;
 * U+FB14 &#xFB14; ARMENIAN SMALL LIGATURE MEN ECH   &rarr;    U+0574 &#x0574;     U+0565 &#x0565;
 * U+FB15 &#xFB15; ARMENIAN SMALL LIGATURE MEN INI   &rarr;    U+0574 &#x0574;     U+056B &#x056B;
 * U+FB16 &#xFB16; ARMENIAN SMALL LIGATURE VEW NOW   &rarr;    U+057E &#x057E;     U+0576 &#x0576;
 * U+FB17 &#xFB17; ARMENIAN SMALL LIGATURE MEN XEH   &rarr;    U+0574 &#x0574;     U+056D &#x056D;
 * U+FB4F &#xFB4F; HEBREW LIGATURE ALEF LAMED   &rarr;    U+05D0 &#x05D0;     U+05DC &#x05DC;

Of course, we may not want this vote to include all of the above; &lt;№&gt;, for example, is not exactly a "digraph". And conversely, we may want it to include some things that aren't listed above; the above-mentioned search criteria are just a first pass, and I welcome other thoughts. But, it's hopefully a starting-point for discussion.

(I realize some of the above is probably gobbledegook to anyone who's not familiar with the guts of Unicode . . . if you have any questions, ask. Though I have to admit that I'm not terribly familiar with the guts of Unicode, either!)

—Ruakh TALK 00:04, 5 July 2011 (UTC)


 * Thanks, but, how complete is that list? You mentioned some ligatures, but not æ... --Daniel 16:44, 5 July 2011 (UTC)


 * It is a perfectly complete list . . . of characters meeting the above-mentioned criteria. Unicode does not give a nontrivial compatibility decomposition for &lt;æ&gt;, so it didn't qualify. But as I mentioned, we may want to consider different criteria. Incidentally, Unicode names &lt;æ&gt; "LATIN SMALL LETTER AE", not "LATIN SMALL LIGATURE AE", though the latter is indicated to be an alias for it. Here is its full entry in NamesList.txt: 00E6&#x9;LATIN SMALL LETTER AE&#xA;&#x9;= latin small ligature ae (1.0)&#xA;&#x9;= ash (from Old English æsc)&#xA;&#x9;* Danish, Norwegian, Icelandic, Faroese, Old English, French, IPA&#xA;&#x9;x (latin small ligature oe - 0153)&#xA;&#x9;x (cyrillic small ligature a ie - 04D5) —Ruakh TALK 18:07, 6 July 2011 (UTC)


 * IMO some of these, at least, are of interest in their own right and should not redirect. U+FB4F &#xFB4F; HEBREW LIGATURE ALEF LAMED, for example, can have an interesting etymology: when it was first used, why and when it's used, etc. This is independent of the page &. The same can be said for the (now former) Rupee sign, the ffi and st families of ligatures, and perhaps more. &#x200b;—msh210℠ (talk) 17:22, 6 July 2011 (UTC)


 * I agree about alef-lamed and the rupee sign, but I think the ffi and st ligatures are exactly what this vote should be about. They're exactly the kind of "character" that is no longer being added to Unicode. —Ruakh TALK 18:07, 6 July 2011 (UTC)
 * I suggest restricting this vote to entries written in Latin script only, regardless of whether characters in Hebrew, Lao, Armenian and Arabic would follow suit. --Daniel 18:24, 6 July 2011 (UTC)
 * Redirecting ﬃ to ffi is a bad idea, because the latter does not exist. We can create it, but I don't see how it would be justifiable. --Daniel 18:24, 6 July 2011 (UTC)
 * Oh, right. Good point. —Ruakh TALK 20:52, 10 July 2011 (UTC)

Redirecting trigraphs
This vote should probably extend to trigraphs, to cover ℻ and ℡ as well. --Daniel 15:58, 5 July 2011 (UTC)

Redirecting Roman numerals
Apparently, should this vote pass, ⅺ will redirect to ⅹⅰ; and, not to xi, because we would still keep the distinction between "generic" Latin letters and Roman numerals. --Daniel 16:56, 5 July 2011 (UTC)

Specific digraphs
I restricted the list to 14 specific redirects. --Daniel 02:33, 10 July 2011 (UTC)