User talk:Robert Ullmann/SC recovery/prop

duplication
I don't understand how this resolves duplication if there is a clause stating: For 99% of all words there would be duplication. Even this "Serbian" and "Bosnian" month names have been used for centuries by Croatian writers (try e.g. search for jul, juli, jula, julu etc. on Croatian WikiSource) and would easily pass CFI. Slavic month names preferred in modern literary Croatian were dialectalisms until the late 19th century, and were moreover also used by Bosniaks and Serbs until the standardization ousted them in preference of Latinisms. --Ivan Štambuk 12:09, 23 July 2009 (UTC)
 * 1) Other language sections may exist, and may not be deleted


 * Because we don't have to create them. Where the other languages duplicate Serbian or Croatian, we won't bother with adding those sections, unless and until someone wants to add examples, citations, etc that aren't common. And we don't do a whole lot of that.


 * So in creating a new entry, you just use one or the other of Serbian or Croatian for the section, and don't create 3 (or 4) language sections. If they do exist as minimal sections it is harmless (as long as correct insofar as they go, or they may have other information. So you might have an entry for a term in Croatian, complete with all the ety and -nyms etc, but also have a minimal section for (say) Bosnian with citations from Bosnian language sources. (Whatever is of interest.)


 * And yes, July is a bad example for the translations! Robert Ullmann 13:26, 23 July 2009 (UTC)


 * But that doesn't really solve the problem of duplication (which still exists, but in another form), and furthermore discriminates among varieties on the basis of scripts. It would be absolutely offensive for 2-3 million Ijekavian Neoštokavian speaking Serbs and 2+ million Ijekavian Neoštokavian speaking Bosniaks to impose ==Croatian== as a "primary entry" (that's the term that this proposal uses), and other sections being mere stubs. No we don't have to create them, but sooner or later somebody will, and per this proposal that kind of action (duplication) would be irreversible. I don't want to have Latin-script words that are also used and written in Latin script exactly the same way by 20 million Slavs other than Croats to be formatted exclusively as ==Croatian==. I assure you that all of non-Croats (and also quite a few Croats themselves) would find that discriminatory and offensive. --Ivan Štambuk 14:01, 23 July 2009 (UTC)


 * Ah, there is a distinction I was just about to explain, and then looked and realized I had written the text backward. Ouch. Sorry. The primary is supposed to be the other way around:


 * For Serbian, the primary entry is Cyrillic.
 * For Croatian, the primary entry is Latin.


 * This means that when creating entries, we would "normally" begin with those two (which must be separate as they are on different pages of course). People are welcome to add the others (it is a wiki) and may eventually do that. But in the meantime, we don't automatically have constant 3 or 4x duplication. (There is no "exclusively".) Sooner or later we need to (as Conrad has noted) have some very different structure and/or software, and that may very well arrive before someone goes and creates tens of thousands of duplicative sections by hand. (And we are unlikely to allow any bot to do that, eh?)


 * (In the wiki process, people will work on what they want to, and will generally not just spend lots of time doing something that doesn't need doing. (Understanding the varying definitions of need, of course.) What I think we will see is some people adding new words as you are, and just adding sr/Cyrl and hr/Latn, and others adding words where they want Bosnian or whatever. No language here is going to be "complete" for a long time.)


 * As I said, ouch. (I don't know how I did that, it certainly wasn't the way I was thinking about it, I was coming from the idea of primary script for a language, as reflected in discussion elsewhere.) Would you kindly look at it again the other way 'round? (-) Robert Ullmann 14:36, 23 July 2009 (UTC)
 * Nevertheless, it would allow duplication and the duplication would be irreversible per proposal. Why force ==Croatian== language name upon a word that is used by millions of people other than Croats? That sounds very discriminatory to me. They must be all treated equally - either all at the same section such as ==Serbo-Croatian==, or all at the separate sections as full-blown entries. --Ivan Štambuk 14:41, 23 July 2009 (UTC)


 * Why? Nothing in a wiki is ever complete. And there is no "forcing", you can add whichever language you like. But if that language is Croatian, you should be adding it to the Latin entry (at least), and if you are adding Serbian, it should be (again, at least) at the Cyrillic title. The practical result is that there will be little duplication. And anyone who thinks a given word ought have a given language section is welcome. It's a wiki. But you don't have to frustrate yourself by endlessly adding multiple sections that aren't really useful. Robert Ullmann 15:03, 23 July 2009 (UTC)
 * It's not about "frustration" (though I often appear to be frustrated when I am not!), it's about the coverage and the NPOV. I don't want to mark something as only ==Croatian== when it is not only Croatian. I don't want to mark something as only ==Serbian== when it is not only Serbian. What I want to is add the related family of words (in cases when they differ, such as jat reflexes) in a single and uniform way. The optimal solution is simply to treat it all in one common language section (termed ==Serbo-Croatian==, ==BCSM== or whatever). Today I simply create one entry in Latin script and convert it to Cyrillic-script based entry in one simply mouse click. So far per your proposal we'd have at least two different L2 sections mutually linking among one another, with more to come once the native speakers figure out that they've somehow been omitted. The proposal does not take into account the fact that e.g. most of the diaspora Serbs are only conversant with Latin script and I'm sure they'd be pretty much pissed off if they found out that the preferred name for Latin-script based Serbian words was - ==Croatian== !
 * IMHO this proposal basically proscribes nothing that hasn't be done before when separate identical entries were being created: it simply camouflages the former practice by under misleading weasel-words such as "primary entry", appearing to have some practical effect in reducing duplication when in fact everything what was done before is still allowed (it's just not "recommended" :), and furthermore openly promotes ethnic discrimination on the basis of preference of script. Having languages named after ethnic groups is a very bad thing by itself (especially in the Balkans), and the best that we could do is by using neutral names such as Serbo-Croatian --Ivan Štambuk 15:37, 23 July 2009 (UTC)


 * The problem there is that "Serbo-Croatian" is not neutral, it is both discriminatory and very offensive. Robert Ullmann 09:55, 24 July 2009 (UTC)
 * It is used by numerous Western (English-speaking and English-Writing Slavists) that I've cited to Daniel Polansky on the vote page, and even by ISO/SIL itself (as a macrolanguage identifier). None of the native speakers that voted (2 Croats, 2 Serbs and 1 Bosniak, as far as I can count, dismissing Pepsi Lite) find it offensive. As I said several times elsewhere, I'd (and I'm sure very much others) agree to any other term, e.g. ==BCS(M)== or whatever - we can even use for SC as a special case a template ==== that would display different text upon user-defined preference - the name is not that all important: what is important is the "equal" treatment the languages get, and problems that get solved by a unified approach. As I said, IMHO it's simply the best to stick to Serbo-Croatain because it's the most widely used English-language name, as something like "BCS" would be very confusing to the most of our readers, as that term is not that well-known even amonst English-speaking learners of SC, whilst everyone is familiar with Serbo-Croatian. At any case, we can change language name in L2 section names and translation tables by a bot almost trivially at any point later, if we decide to do so. What is important in the meantime is that there is not discriminatory treatment - we can treat all standard languages equally in separate sections, or all equally in the same section. Your proposal would work great if B/C/S were all written in different scripts - like we have Hindi and Urdu written in Devanagari and Arabic respectively, and Romanian and Moldovan written in Latin and Cyrillic respectively - they are also linguistically one language (the former often called Hindustani) with AFAIK the same phonology, inflection.., but they don't share scripts so they would never overlap. So we can link among their corresponding entries in different scripts in the headword lines easily. With SC this is entirely different story: Serbian is written in both scripts, and standardized on both variants (Ekavian and Ijekavian), Bosnian and Croatian only on Latin script in Ijekavian variety. In a few years we could possibly also have Montenegrin standardized on Ijekavian and also in 2 scripts! Moreover, Ekavian can also be used for Croatian (Kajkavian dialects and Northern Chakavian dialects have reflex of yat written as , e.g. mleko "milk" vs. ijekavian mlijeko), and even the non-literary variety Ikavian (with /i/ as a reflex of jat - e.g. mliko) is shared - among Bosniaks and Croats. So there's no easy way to make up for all that mess, where duplication is imminent in the methodology the entries are created, unless we simply treat them as all at one language section. --Ivan Štambuk 11:09, 24 July 2009 (UTC)
 * Serbo-Croatian has been the normal term during a long time, and it is the most usual name for the language including Serbian, Croatian, Bosnian and Montenegrin. As such, using the word is not offensive in itself, if this meaning is what is meant. But you also know that using it and prohibiting Bosnian... headers would be felt as discriminatory and offensive by many people (especially by Bosniaks, maybe? I don't know) and that this was very probably a major reason for defining these languages as separate languages in ISO. I'm convinced that your motivation is very praiseworthy, but favouring peace and understanding between peoples is best achieved by making work on a same project easier, not by dismissing people disagreeing with you.
 * Also, it would be paradoxal to prohibit Bosnian headers for words written in Bosnian Cyrillic script (even if this script was used for Croatian). Lmaltier 11:51, 24 July 2009 (UTC)
 * I understand your concerns Lmaltier and once again I restate mine: in no way are we discriminating among individual languages if all of them get treated equally at the unified header. All of the native Serbo-Croatian speakers are well-aware how "different" their varieties are. Bosniaks and Bosnians especailly - because Bosnia and Herzegovina is a multiethnic federation state, with large-scale mixtures of population (look up the ethnic map and see for yourself), and B&H Bosniaks/Croats/Serbs, all speaking the same sub-idiom (Ijekavian Neoštokavian) with the polarization of lexis much, much lower than the standards would insinuate to the innocent outside observer, are aware that the language they speak is one, simply under different "official" names. The B&H constitution can afford to have 3 official names for languages, and SIL/ISO can afford itself to grant 3 different codes (they're just fulfilling the requests passed by the governments, "look we have X grammar, dictionary and orthography, and it's official by the constitution, now give us the code!") - but practical dictionary cannot, and there is no such dictionary in the world in any language that simultaneously has Bosnian, Croatian and Serbian treated separately. Look at the Slavic languages departments on the universities in France - try finding ones that only teach "Croatian language" or "Bosnian language". It's always "in package". In this particular case we have two options for equal NPOV treatment -either all must go separately, giving precedence to none (on the basis of script, jat reflex or whatever) or all unified. Forbidding particular language section names is not forbidding words in them, which would still appear (conveniently marked with context labels when they are variety-specific), so I can't see any kind of discrimination that you speak of.
 * but favouring peace and understanding between peoples is best achieved by making work on a same project easier, not by dismissing people disagreeing with you - you must me joking. How exactly are there native speakers of Serbo-Croatian (with the exception of Pepsi Lite) that voted against the proposal? How many actual Serbo-Croatian contributors are there voting for oppose? You all seem to be concerned for problems that aren't there, and proposing the type of "solutions" which - you wouldn't be the one having to clean up, or waste countless hours maintaining. It's easy to speak of lofty words such as NPOV or tolerance when they don't touch your work here at all. As I said, if you were the ones actually contributing SC entries separately, you'd be singing completely different tune.
 * As for the bosančica - A bulk of important works written in it are part of Croatian literary heritage (Bosnian Franciscan monks used it almost exclusively until the late 18th century), Bosnian Muslim elite primarly used arebica or latinica during the Ottoman era when writing Slavic (which was also very rare, as most of fine literature produced by the Bosnian Muslim cultural elite was in Ottoman Turkish, Persian, Arabic and other "big" languages). Don't mistake it by name Bosnian Cyrillic != Bosnian language, it was used on much wider territory also by other ethnicities (Croats and Serbs). --Ivan Štambuk 12:20, 24 July 2009 (UTC)

recovery of sections from history
How useful this is, or isn't, is something still to be worked on. Just adding back in duplication isn't helpful. I'm still looking at what is out there. If the bot is to pick any up, they should be tagged. Robert Ullmann 13:33, 23 July 2009 (UTC)

Bot work
I think that a bot cannot systematically create a Serbian or a Croatian section from a Serbo-Croatian section without introducing errors: it would probably work in most cases, but there may be exceptions. This task might be possible, but all words would have to be manually checked at some time, preferably beforehand.

The only possible safe bot action is to look in the history for such sections and restore them when needed. Lmaltier 13:38, 23 July 2009 (UTC)


 * I've been running tests and looking through the logs (locally), and from what I see it can tag the entries that need review. Entries with notes (in whatever format) mentioning specific languages, or dialects for example. If it can do this fairly reliably, then we should be good. The detailed changes that are needed are pure bot-work, you would not want to do by hand. Restoring the "old" Serbian or Croatian section wipes out the work that has been done. In no case do we want to 'replace the current section with an old one.


 * Checking them after the bot work is easier than before, as one can just edit, rather than editing and then running the bot and then checking again. Assuming we can tag correctly. Robert Ullmann 13:49, 23 July 2009 (UTC)


 * I've added a column to the report table to show whether the entry would be tagged, so we can look at cases more easily. Robert Ullmann 10:03, 24 July 2009 (UTC)