[Accessforall] Last call for comments: Codes for languages in ISO 24751 and the registry

Christophe Strobbe strobbe at hdm-stuttgart.de
Thu Oct 11 14:57:30 EDT 2012


Hi,

I have not seen any objections to the proposal to use IETF BCP 47 as the
standard for the term "language" in the Registry. I have collected some of
the content of the discussions (and some additional information) in the
wiki page at
<http://wiki.gpii.net/index.php/Discussion_on_Profile_Structure#Language_Codes>,
and below in this message (for the links, please read the version in the
wiki). I would like to give you some time to review this, so we can reach
consensus on this. If there are no objections by Monday evening (15
October) I will assume that we have reached consensus. If any
clarifications are needed, please let me know as soon as possible.

Best regards,

Christophe Strobbe


The text from the wiki page:


One of the terms in the current version of the Registry is language
(description: "a preference for the language of the user interface"). The
value space is tentatively defined as the values defined by ISO 639-2/T.
ISO 639-2/T identifies languages by means of three-letter codes (instead
of the ISO 639-1 two-letter codes that are commonly used in HTML pages)
without a means of identifying variants (see also the list of ISO 639-2
codes on Wikipedia).

Proposal:

Use IETF BCP 47 instead of ISO 639-2/T as the format for identifying
languages.
* BCP 47 defines a language tag is consisting of a primary language
subtag, followed by several optional subtags (especially for script,
region and/or variant).
 - Scripts can be identified by means of codes defined by ISO 15924:2004.
For example, zh-Hans and zh-Hant have sometimes been used to distinguish
between Chinese with Simplified Characters and with Traditional
Characters, respectively. The registration authority for ISO 15924 tags
is the Unicode Consortium; see Codes for the representation of names of
scripts.
 - Regions, including countries, can be identified by means of codes
defined by ISO 3166-1. An ISO 3166-1 decoding table is available on the
ISO website. The list of alpha-2 country codes (in TXT, HTML or XML) is
available free of charge for internal use and non-commercial purposes.
The full ISO 3166-1:2006, which also contains the alpha-3 codes and the
numeric codes, is not available free of charge.
* BCP 47 allows the use of three-letter codes for primary language tags
defined by ISO 639-3. The registration authority for ISO 639-3 tags is SIL
International; see ISO 639-3 Registration Authority. Using ISO 639-3 has
several advantages:
 - This list is more complete than ISO 639-1 and ISO 639-2.
 - ISO 639-3 provides more precision for the identification of languages:
some of the ISO 639-1 codes actually referred to macrolanguages, for
example zh (Chinese) and ar (Arabic). The ISO 639-3 list distinguishes
between macrolanguages and sublanguages, for example zho (Chinese) has
sublanguages such as cmn (Mandarin), hak (Hakka) and yue (Yue or
Cantonese). These distinctions can trigger different Braille conversion
tables or text-to-speech engines (e.g. Ekho supports Cantonese, Mandarin
and Zhaoan Hakka), so these distinctions are relevant to accessibility.
See the ISO 639-3 Macrolanguage Mappings.
 - Three letter codes also allow us to identify sign languages. ISO 639-2
contains the tag "sgn" for sign language (which would need to be refined
with subtags), and ISO 639-3 contains tags for individual sign languages,
such as ase (American Sign Language), asf (Australian Sign Language) and
sgg (Swiss-German Sign Language). ISO 639-1, by contrast, contained no
tags to identify sign languages.
* BCP 47 is also the standard for values of lang and xml:lang in HTML5.
* ISO standards can use IETF RFCs and BCPs as normative references.

Note:
* While the set of languages supported by assistive technologies is only a
very small subset of the (over 5000) living languages, it is also
important to support the matching of resources in specific languages
(including subtitles, captions, etc) with languages that a user
understands, and this is probably a much wider range than what is
supported by AT.
* Implementations would need to synchronise their list of languages with
the list maintained by SIL International (the registration authority for
ISO 639-3), since language tags may be retired (see the Retired ISO 639-3
Codes).
* Implementations would need to synchronise their list of country codes
with the list maintained the ISO 3166 Maintenance Authority, since country
codes may be added or withdrawn (e.g. the country code for Yugoslavia was
withdrawn).
* There are a few special language codes:
 - Content in an undetermined language can be tagged with 'und' (ISO 639-2
and ISO 639-3). BCP 47 points out that this tag should only be used if a
language tag is required.
 - Content in an uncoded language can be tagged with 'mis' (ISO 639-2 and
ISO 639-3), i.e. the language is known but has no language code.
 - Non-linguistic content can be tagged with 'zxx' (ISO 639-2 and ISO
639-3), i.e. sound recordings with only nonverbal sounds, instrumental
music, programming source code.
 - Content in multiple languages can be tagged with 'mul' (ISO 639-2 and
ISO 639-3). BCP 47 points out that this tag "SHOULD NOT be used when a
list of languages or individual tags for each content element can be used
instead".
 - There is no "default country code" for languages, so if content is
tagged with only "eng" (English), there is insufficient information to
decide, for example, whether an American, Canadian, British or Australian
Braille translation table should be used.
 - The language tags described in IETF BCP 47 "are sequences of characters
from the US-ASCII [ISO646] repertoire". (This does not prohibit the use
of language tags in UTF-8 content. As Wikipedia points out: "The first
128 characters of Unicode, which correspond one-to-one with ASCII, are
encoded using a single octet with the same binary value as ASCII, making
valid ASCII text valid UTF-8-encoded Unicode as well.")




Am Fr, 5.10.2012, 17:09 schrieb Christophe Strobbe:
>
> Am Do, 4.10.2012, 21:23 schrieb Gregg Vanderheiden:
>> Great discussion
>>
>> We need to have someone who will own this issue and manage it through to
>> resolution.
>>
>> Christophe, can you take ownership of this  -- and work with everyone to
>> find a resolution?
>
>
> OK.
> I currently consider IETF BCP 47 <http://tools.ietf.org/html/bcp47> the
> most appropriate standard to use for the "language" term in the registry.
> In addition to what I wrote in the last two days, BCP 47 is also the
> format for the lang and xml:lang attributes in the current HTML5 draft:
> <http://www.w3.org/TR/html5/global-attributes.html#the-lang-and-xml:lang-attributes>.
> If anybody wants to speak against using IETF BCP 47 to define the value
> space for "language" in the registry, please do so by Tuesday evening next
> week (10 October).
>
> Best regards,
>
> Christophe Strobbe
>
>
>>
>>
>> Gregg
>> --------------------------------------------------------
>> Gregg Vanderheiden Ph.D.
>> Director Trace R&D Center
>> Professor Industrial & Systems Engineering
>> and Biomedical Engineering
>> University of Wisconsin-Madison
>>
>> Technical Director - Cloud4all Project - http://Cloud4all.info
>> Co-Director, Raising the Floor - International
>> and the Global Public Inclusive Infrastructure Project
>> http://Raisingthefloor.org   ---   http://GPII.net
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Oct 4, 2012, at 6:48 AM, Christophe Strobbe
>> <strobbe at hdm-stuttgart.de>
>> wrote:
>>
>>>
>>> A few things to bear in mind before making this decision:
>>> 1. ISO 639-2 (or any other part of ISO 639) just covers the codes for
>>> the
>>> identification of languages, not subcodes for countries, scripts, etc.
>>> 2. IETF RFC 4646 describes how to combine ISO 639 language codes with
>>> ISO
>>> 3166 country codes (and other optional subtags), but prefers two-letter
>>> language codes over three-letter codes if the former type of code is
>>> available. So that would gives us en-CA instead of eng-CA. So if we
>>> want
>>> to use codes like en-CA, we should refer to IETF RFC 4646; in order to
>>> use
>>> tags like eng-CA, we would need to invent our own "standard" for
>>> language
>>> codes. If we prefer IETF RFC 4646 tags, we will need to check if ISO
>>> standards can use IETF RFCs as normative references.
>>> 3. The two-letter language code is what you find in HTML pages, the
>>> OpenDocument format, and many other formats. That might be the reason
>>> why
>>> this type of code was in the sample preference sets. If we use
>>> three-letter codes, some parts of the GPII/Cloud4all architecture will
>>> need to refer to a table that maps two-letter codes to three-letter
>>> codes,
>>> because the two-letter codes seem to be the dominant convention (but
>>> that
>>> might change; e.g. Dublin Core seems to accept both types of codes).
>>>
>>>
>>> I am not speaking against using codes like eng-CA, but we should know
>>> what
>>> the impact of this decision would be.
>>>
>>>
>>> Best regards,
>>>
>>> Christophe
>>>
>>> Am Do, 4.10.2012, 07:18 schrieb Gregg Vanderheiden:
>>>> OK
>>>>
>>>> 	Does anyone want to SPEAK AGAINST doing as Colin outlined which seems
>>>> to
>>>> be in line with everyone else's comments.
>>>>
>>>> 	  If so please post any counter thoughts in the next few days.    We
>>>> have
>>>> everyone I think on the two lists attached so we can make a decision
>>>> if
>>>> there are no counter proposals to consider
>>>>
>>>> thanks
>>>>
>>>>
>>>> Gregg
>>>> --------------------------------------------------------
>>>> Gregg Vanderheiden Ph.D.
>>>> Director Trace R&D Center
>>>> Professor Industrial & Systems Engineering
>>>> and Biomedical Engineering
>>>> University of Wisconsin-Madison
>>>>
>>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>>> Co-Director, Raising the Floor - International
>>>> and the Global Public Inclusive Infrastructure Project
>>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>>
>>>>
>>>> On Oct 3, 2012, at 10:44 PM, Colin Clark <colinbdclark at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We should be using ISO 639-2 language codes throughout the system. If
>>>>> not, it's a bug.
>>>>>
>>>>> If I remember correctly, this was probably introduced by the UI
>>>>> Options
>>>>> team who were integrating at very short notice with the GPII
>>>>> framework.
>>>>> I believe UI Options can support both two- and three-character
>>>>> language
>>>>> codes (as is often the case).
>>>>>
>>>>> As a speaker of "eng-CA", I don't see any reason not to simply use
>>>>> ISO
>>>>> 639-2 from the start and to also support country codes, as Christophe
>>>>> suggests. I also think it's probably worth supporting the
>>>>> two-character
>>>>> subset for interoperability if possible.
>>>>>
>>>>> Colin
>>>>>
>>>>> On 2012-10-03, at 1:18 PM, Gregg Vanderheiden wrote:
>>>>>
>>>>>> I think that having language and country codes is a great idea.
>>>>>>
>>>>>> Wd DO need to decide which codes to use.  I think the square
>>>>>> brackets
>>>>>> were because an official decision was not made yet
>>>>>>
>>>>>> But I think using the ISO codes for both would be the right thing to
>>>>>> do.  I added the arch list to see if someone knows  why two letter
>>>>>> codes are currently used.  (W3C?)
>>>>>>
>>>>>> We also should say something like  "if no country is specified then
>>>>>> ...."
>>>>>> (is there a default country for all languages specified somewhere?)
>>>>>> we might say the country of origin -- but I'm not sure all languages
>>>>>> have an (existing) country of origin anymore.
>>>>>>
>>>>>> Good catch Christophe.
>>>>>> Lets get a decision and then record it in the Glossary.
>>>>>>
>>>>>> I wonder if we should have a decision registry somewhere since we
>>>>>> have
>>>>>> so many people involved.
>>>>>>
>>>>>>
>>>>>> Gregg
>>>>>> --------------------------------------------------------
>>>>>> Gregg Vanderheiden Ph.D.
>>>>>> Director Trace R&D Center
>>>>>> Professor Industrial & Systems Engineering
>>>>>> and Biomedical Engineering
>>>>>> University of Wisconsin-Madison
>>>>>>
>>>>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>>>>> Co-Director, Raising the Floor - International
>>>>>> and the Global Public Inclusive Infrastructure Project
>>>>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Oct 3, 2012, at 11:43 AM, Christophe Strobbe
>>>>>> <christophestrobbe at yahoo.co.uk> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> While creating a preference set for one of the personas in the
>>>>>>> Cloud4all smarthouse simulation
>>>>>>> <http://wiki.gpii.net/index.php/SmartHouses_Preference_Sets>, I
>>>>>>> looked
>>>>>>> into language codes and found the following:
>>>>>>> (1) ISO/IEC 24751:2008 (all subparts) refer to ISO 639-2:1998 for
>>>>>>> language codes. In the registry, the value space for "language" is
>>>>>>> [ISO 639-2/T] (I don't know the reason for the square brackets).
>>>>>>> According to
>>>>>>> <https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes>
>>>>>>> and <http://www.loc.gov/standards/iso639-2/php/code_list.php>, the
>>>>>>> ISO
>>>>>>> 639-2 codes are three-letter codes (e.g. "eng" for English, "dut"
>>>>>>> or
>>>>>>> "nld" for Dutch, "fre" or "fra" for French, etc). However, the JSON
>>>>>>> preference sets I've seen so far (I mean those by the
>>>>>>> GPII/Cloud4all
>>>>>>> Architecture team) use two-letter codes (see Carla's, Nisha's and
>>>>>>> Timothy's preference sets). Am I misreading the information I found
>>>>>>> about ISO 639-2?
>>>>>>> (2) Related to this is the absence of country information, i.e.
>>>>>>> combining a language code with a country code from ISO 3166 (see
>>>>>>> <http://www.loc.gov/standards/iso639-2/faq.html#22>). This is
>>>>>>> relevant
>>>>>>> to text-to-speech engines and Braille. For example for Dutch, not
>>>>>>> many
>>>>>>> people in Flanders are keen on TTS that uses pronunciation rules
>>>>>>> from
>>>>>>> the Netherlands. Braille conventions also vary between countries
>>>>>>> that
>>>>>>> use the same official language (well, they even vary between
>>>>>>> Braille
>>>>>>> centres, but let's not go into that).
>>>>>>> (3) Note that IETF RFC 4646 <http://tools.ietf.org/html/rfc4646>
>>>>>>> gives
>>>>>>> preference to the shortest ISO 639 code (2 or three letters) that
>>>>>>> is
>>>>>>> available for a language (check the ABNF syntax under
>>>>>>> <http://tools.ietf.org/html/rfc4646#section-2.1>). This base code
>>>>>>> can
>>>>>>> then be combined with an ISO 3166 country code, to create tags like
>>>>>>> en-US (American English) and en-GB (British English). However, IETF
>>>>>>> RFC 4646 is referenced neither by ISO 24751 nor by the registry.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Christophe Strobbe
>>>>>>>
>>>>>
>>>>> ---
>>>>> Colin Clark
>>>>> Technical Lead, Fluid Project
>>>>> http://fluidproject.org
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Christophe Strobbe
>>
>>
>
>
> --
> Christophe Strobbe
> Akademischer Mitarbeiter
> Adaptive User Interfaces Research Group
> Hochschule der Medien
> Nobelstraße 10
> 70569 Stuttgart
> Tel. +49 711 8923 2749
>
> _______________________________________________
> Accessforall mailing list
> Accessforall at fluidproject.org
> http://lists.idrc.ocad.ca/cgi-bin/mailman/listinfo/accessforall
>


-- 
Christophe Strobbe
Akademischer Mitarbeiter
Adaptive User Interfaces Research Group
Hochschule der Medien
Nobelstraße 10
70569 Stuttgart
Tel. +49 711 8923 2749



More information about the Accessforall mailing list