User:Mzajac/Language attributes

This is a brainstorming page for adding standard HTML language metadata (e.g., )  to the workings of script templates.

This would require that, , , , and their variants provide a  parameter to the script template. The result would be HTML language metadata in all of their output.

Rationale
Lang and xml:lang are standard HTML metadata attributes for identifying the language of an element's content. They can be used to style text using CSS, possibly to supplement or replace the classes used in script templates. According to the HTML 4.01 specification, situations where language information may be helpful include assisting search engines, assisting speech synthesizers, helping a user agent select glyph variants for high quality typography, helping a user agent choose a set of quotation marks, helping a user agent make decisions about hyphenation, ligatures, and spacing, and assisting spell checkers and grammar checkers. HTML 5 says that the web browser may use the element's language, e.g., in the selection of appropriate fonts or pronunciations, or for dictionary selection.

An example application would be the use of standardized CSS selectors to style any language, rather than depending on the few classes defined in Wiktionary. For example, to use spaced small caps instead of italics for Ukrainian:

Accessibility guidelines stress the importance of indicating language. “Clearly identify changes in the natural language of a document's text and any text equivalents (e.g., captions). [Priority 1]” (WCAG 1.0, 1999). “The human language of each passage or phrase in the content can be programmatically determined except for proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text. (Level AA)” (WCAG 2.0, 2008).

Wiktionary should strive to provide language metadata.

Code
The example is used. The essential working code for most templates looks like the example below. ({Cyrl} actually has some #switch code which can turn the span into an i or b, and a deprecated .RU class, ignored here for clarity):

, but  or   (Russian is normally written in Cyrillic, but Serbian is in both Cyrillic and Latin, so it should have the script specified). This can be handled by a switch statement which filters the default languages:


 * Empty lang
 * &#123;{Cyrl |слово |lang= }}


 * A default language for Cyrl
 * &#123;{Cyrl |слово |lang=uk }}


 * An ambiguous language for Cyrl
 * &#123;{Cyrl |слово |lang=sr }}


 * An undefined language
 * &#123;{Cyrl |слово |lang=und }}


 * Oops
 * &#123;{Cyrl |слово |lang=sr-Cyrl }}


 * Junk
 * &#123;{Cyrl |слово |lang=NONSENSE! }}

What's the best way to deal with incorrect input? Should we add comprehensive error-checking? Should error input be corrected, dropped silently, or throw an error message?

Should the input case be adjusted (EN > en)? This is supposed to be case-insensitive, so changing it is safe, and it may help buggy implementations deal with our content.

If the code grows complex, should it be generalized and maintained in a single template, to be transcluded into any script template?

Complex example
Rolling and its partners into a single template (including, , , , , , , , and ). This also supports varying class attribute:

&lt;span dir="rtl" &#123;{ #if: &#123;&#123;{lang|}}} | &#123;{#switch: | ar = class="Arab" lang="ar" xml:lang="ar" | fa | ps | ur = class="-Arab" lang="" xml:lang="" | ks | ku | ota | pa | sd | ug = class="-Arab" lang="-Arab" xml:lang="-Arab" | az | tg | #default = class="Arab" lang="-Arab" xml:lang="-Arab" }} | class="Arab"}}>&lt;/span>

List of language subtags
Data from the IANA subtags registry (2008-11-25) has been extracted to User:Mzajac/Language attributes/IANA subtags. It needs amendments.

Language tags
Language tags: HTML 4.01 specifies the format for language tags to follow 1766 (1995). HTML 5 specifies its replacement, 3066 (2001). The latest specification is 4646 (2006), and another revision is in progress.

No language
Language codes for no language.


 * zxx – non-linguistic matter (e.g., type samples, part numbers, binary data streams)
 * und – for text of undetermined language

XHTML doesn't allow the empty string (xml:lang=""). HTML 5 says “Setting the attribute to the empty string indicates that the primary language is unknown”.

We should avoid setting empty language tags.

To do

 * Support for region or variant subtags [need a list of use cases: probably for Arabic script ( et al) and Cuneiform ]
 * Compile list of languages by script, or at least languages which conventionally use multiple scripts [started, at /IANA subtags]
 * Flag unusual language–script combinations [with a hidden category?]
 * Allow explicitly setting an empty language tag (meaning “unknown language”) [do we need this?]
 * Filter out lang="en", which needn't be included because this is the English-language Wiktionary's primary language, set in an entry's top-level HTML element [not required]
 * Generalize the code for any script template by using &#123;{PAGENAME}} instead of “Cyrl” for both the class and lang attributes. [probably pointless]
 * Use the #language: parser function to generalize the code for any Wiktionary [probably not worth it]