Thread:User talk:CodeCat/Transliteration questions/reply

I can't help you at all on the first part, sorry.

For the Italic alphabets, the common set was chosen so that it could apply for all languages. If it doesn't apply to all languages equally, then it shouldn't be in the common set. Alternatively, you could transliterate the language-specific features first, and let the common set handle whatever remains after that.

Something you need to be careful with is using gsub with '.' to replace multiple-character combinations. That's not going to work. Sadly, extending it to '..' will not work either in case you were thinking of that. The way I handle these situations is a bit more elaborate but it works much better at least.


 * "rest" contains characters yet to be processed, "parts" is a table containing characters or sequences that were recognised.
 * Look at the "rest" string for the longest match with each one of the character search sequences.
 * Once the longest match is determined, insert that into the list of parts. If no match was found at all, just insert the first character.
 * Remove the processed characters from "rest".
 * Repeat until "rest" is empty.