[Accessforall] [Architecture] Last call for comments: Codes for languages in ISO 24751 and the registry

Gregg Vanderheiden gv at trace.wisc.edu
Tue Oct 23 12:36:54 EDT 2012


There is now a decisions page -- and this is logged on it

http://wiki.gpii.net/index.php/DECISIONS_LOG


Gregg
--------------------------------------------------------
Gregg Vanderheiden Ph.D.
Director Trace R&D Center
Professor Industrial & Systems Engineering
and Biomedical Engineering University of Wisconsin-Madison
Technical Director - Cloud4all Project - http://Cloud4all.info
Co-Director, Raising the Floor - International - http://Raisingthefloor.org
and the Global Public Inclusive Infrastructure Project -  http://GPII.net

On Oct 23, 2012, at 7:56 AM, Christophe Strobbe <strobbe at hdm-stuttgart.de> wrote:

> Hi,
> 
> I was ill last week, so I'm catching up now. I have not seen any comments
> on objections on this proposal, so I have marked it as accepted by common
> consent on the wiki (with links to the calls for comments etc).
> See
> <http://wiki.gpii.net/index.php/Discussion_on_Profile_Structure#Language_Codes>.
> 
> When we have a separate Decisions page, we can reference this decision
> from there.
> 
> Best regards,
> 
> Christophe
> 
> 
> Am Do, 11.10.2012, 20:57 schrieb Christophe Strobbe:
>> 
>> Hi,
>> 
>> I have not seen any objections to the proposal to use IETF BCP 47 as the
>> standard for the term "language" in the Registry. I have collected some of
>> the content of the discussions (and some additional information) in the
>> wiki page at
>> <http://wiki.gpii.net/index.php/Discussion_on_Profile_Structure#Language_Codes>,
>> and below in this message (for the links, please read the version in the
>> wiki). I would like to give you some time to review this, so we can reach
>> consensus on this. If there are no objections by Monday evening (15
>> October) I will assume that we have reached consensus. If any
>> clarifications are needed, please let me know as soon as possible.
>> 
>> Best regards,
>> 
>> Christophe Strobbe
>> 
>> 
>> The text from the wiki page:
>> 
>> 
>> One of the terms in the current version of the Registry is language
>> (description: "a preference for the language of the user interface"). The
>> value space is tentatively defined as the values defined by ISO 639-2/T.
>> ISO 639-2/T identifies languages by means of three-letter codes (instead
>> of the ISO 639-1 two-letter codes that are commonly used in HTML pages)
>> without a means of identifying variants (see also the list of ISO 639-2
>> codes on Wikipedia).
>> 
>> Proposal:
>> 
>> Use IETF BCP 47 instead of ISO 639-2/T as the format for identifying
>> languages.
>> * BCP 47 defines a language tag is consisting of a primary language
>> subtag, followed by several optional subtags (especially for script,
>> region and/or variant).
>> - Scripts can be identified by means of codes defined by ISO 15924:2004.
>> For example, zh-Hans and zh-Hant have sometimes been used to distinguish
>> between Chinese with Simplified Characters and with Traditional
>> Characters, respectively. The registration authority for ISO 15924 tags
>> is the Unicode Consortium; see Codes for the representation of names of
>> scripts.
>> - Regions, including countries, can be identified by means of codes
>> defined by ISO 3166-1. An ISO 3166-1 decoding table is available on the
>> ISO website. The list of alpha-2 country codes (in TXT, HTML or XML) is
>> available free of charge for internal use and non-commercial purposes.
>> The full ISO 3166-1:2006, which also contains the alpha-3 codes and the
>> numeric codes, is not available free of charge.
>> * BCP 47 allows the use of three-letter codes for primary language tags
>> defined by ISO 639-3. The registration authority for ISO 639-3 tags is SIL
>> International; see ISO 639-3 Registration Authority. Using ISO 639-3 has
>> several advantages:
>> - This list is more complete than ISO 639-1 and ISO 639-2.
>> - ISO 639-3 provides more precision for the identification of languages:
>> some of the ISO 639-1 codes actually referred to macrolanguages, for
>> example zh (Chinese) and ar (Arabic). The ISO 639-3 list distinguishes
>> between macrolanguages and sublanguages, for example zho (Chinese) has
>> sublanguages such as cmn (Mandarin), hak (Hakka) and yue (Yue or
>> Cantonese). These distinctions can trigger different Braille conversion
>> tables or text-to-speech engines (e.g. Ekho supports Cantonese, Mandarin
>> and Zhaoan Hakka), so these distinctions are relevant to accessibility.
>> See the ISO 639-3 Macrolanguage Mappings.
>> - Three letter codes also allow us to identify sign languages. ISO 639-2
>> contains the tag "sgn" for sign language (which would need to be refined
>> with subtags), and ISO 639-3 contains tags for individual sign languages,
>> such as ase (American Sign Language), asf (Australian Sign Language) and
>> sgg (Swiss-German Sign Language). ISO 639-1, by contrast, contained no
>> tags to identify sign languages.
>> * BCP 47 is also the standard for values of lang and xml:lang in HTML5.
>> * ISO standards can use IETF RFCs and BCPs as normative references.
>> 
>> Note:
>> * While the set of languages supported by assistive technologies is only a
>> very small subset of the (over 5000) living languages, it is also
>> important to support the matching of resources in specific languages
>> (including subtitles, captions, etc) with languages that a user
>> understands, and this is probably a much wider range than what is
>> supported by AT.
>> * Implementations would need to synchronise their list of languages with
>> the list maintained by SIL International (the registration authority for
>> ISO 639-3), since language tags may be retired (see the Retired ISO 639-3
>> Codes).
>> * Implementations would need to synchronise their list of country codes
>> with the list maintained the ISO 3166 Maintenance Authority, since country
>> codes may be added or withdrawn (e.g. the country code for Yugoslavia was
>> withdrawn).
>> * There are a few special language codes:
>> - Content in an undetermined language can be tagged with 'und' (ISO 639-2
>> and ISO 639-3). BCP 47 points out that this tag should only be used if a
>> language tag is required.
>> - Content in an uncoded language can be tagged with 'mis' (ISO 639-2 and
>> ISO 639-3), i.e. the language is known but has no language code.
>> - Non-linguistic content can be tagged with 'zxx' (ISO 639-2 and ISO
>> 639-3), i.e. sound recordings with only nonverbal sounds, instrumental
>> music, programming source code.
>> - Content in multiple languages can be tagged with 'mul' (ISO 639-2 and
>> ISO 639-3). BCP 47 points out that this tag "SHOULD NOT be used when a
>> list of languages or individual tags for each content element can be used
>> instead".
>> - There is no "default country code" for languages, so if content is
>> tagged with only "eng" (English), there is insufficient information to
>> decide, for example, whether an American, Canadian, British or Australian
>> Braille translation table should be used.
>> - The language tags described in IETF BCP 47 "are sequences of characters
>> from the US-ASCII [ISO646] repertoire". (This does not prohibit the use
>> of language tags in UTF-8 content. As Wikipedia points out: "The first
>> 128 characters of Unicode, which correspond one-to-one with ASCII, are
>> encoded using a single octet with the same binary value as ASCII, making
>> valid ASCII text valid UTF-8-encoded Unicode as well.")
>> 
>> 
>> 
>> 
>> Am Fr, 5.10.2012, 17:09 schrieb Christophe Strobbe:
>>> 
>>> Am Do, 4.10.2012, 21:23 schrieb Gregg Vanderheiden:
>>>> Great discussion
>>>> 
>>>> We need to have someone who will own this issue and manage it through
>>>> to
>>>> resolution.
>>>> 
>>>> Christophe, can you take ownership of this  -- and work with everyone
>>>> to
>>>> find a resolution?
>>> 
>>> 
>>> OK.
>>> I currently consider IETF BCP 47 <http://tools.ietf.org/html/bcp47> the
>>> most appropriate standard to use for the "language" term in the
>>> registry.
>>> In addition to what I wrote in the last two days, BCP 47 is also the
>>> format for the lang and xml:lang attributes in the current HTML5 draft:
>>> <http://www.w3.org/TR/html5/global-attributes.html#the-lang-and-xml:lang-attributes>.
>>> If anybody wants to speak against using IETF BCP 47 to define the value
>>> space for "language" in the registry, please do so by Tuesday evening
>>> next
>>> week (10 October).
>>> 
>>> Best regards,
>>> 
>>> Christophe Strobbe
>>> 
>>> 
>>>> 
>>>> 
>>>> Gregg
>>>> --------------------------------------------------------
>>>> Gregg Vanderheiden Ph.D.
>>>> Director Trace R&D Center
>>>> Professor Industrial & Systems Engineering
>>>> and Biomedical Engineering
>>>> University of Wisconsin-Madison
>>>> 
>>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>>> Co-Director, Raising the Floor - International
>>>> and the Global Public Inclusive Infrastructure Project
>>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Oct 4, 2012, at 6:48 AM, Christophe Strobbe
>>>> <strobbe at hdm-stuttgart.de>
>>>> wrote:
>>>> 
>>>>> 
>>>>> A few things to bear in mind before making this decision:
>>>>> 1. ISO 639-2 (or any other part of ISO 639) just covers the codes for
>>>>> the
>>>>> identification of languages, not subcodes for countries, scripts, etc.
>>>>> 2. IETF RFC 4646 describes how to combine ISO 639 language codes with
>>>>> ISO
>>>>> 3166 country codes (and other optional subtags), but prefers
>>>>> two-letter
>>>>> language codes over three-letter codes if the former type of code is
>>>>> available. So that would gives us en-CA instead of eng-CA. So if we
>>>>> want
>>>>> to use codes like en-CA, we should refer to IETF RFC 4646; in order to
>>>>> use
>>>>> tags like eng-CA, we would need to invent our own "standard" for
>>>>> language
>>>>> codes. If we prefer IETF RFC 4646 tags, we will need to check if ISO
>>>>> standards can use IETF RFCs as normative references.
>>>>> 3. The two-letter language code is what you find in HTML pages, the
>>>>> OpenDocument format, and many other formats. That might be the reason
>>>>> why
>>>>> this type of code was in the sample preference sets. If we use
>>>>> three-letter codes, some parts of the GPII/Cloud4all architecture will
>>>>> need to refer to a table that maps two-letter codes to three-letter
>>>>> codes,
>>>>> because the two-letter codes seem to be the dominant convention (but
>>>>> that
>>>>> might change; e.g. Dublin Core seems to accept both types of codes).
>>>>> 
>>>>> 
>>>>> I am not speaking against using codes like eng-CA, but we should know
>>>>> what
>>>>> the impact of this decision would be.
>>>>> 
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> Christophe
>>>>> 
>>>>> Am Do, 4.10.2012, 07:18 schrieb Gregg Vanderheiden:
>>>>>> OK
>>>>>> 
>>>>>> 	Does anyone want to SPEAK AGAINST doing as Colin outlined which
>>>>>> seems
>>>>>> to
>>>>>> be in line with everyone else's comments.
>>>>>> 
>>>>>> 	  If so please post any counter thoughts in the next few days.    We
>>>>>> have
>>>>>> everyone I think on the two lists attached so we can make a decision
>>>>>> if
>>>>>> there are no counter proposals to consider
>>>>>> 
>>>>>> thanks
>>>>>> 
>>>>>> 
>>>>>> Gregg
>>>>>> --------------------------------------------------------
>>>>>> Gregg Vanderheiden Ph.D.
>>>>>> Director Trace R&D Center
>>>>>> Professor Industrial & Systems Engineering
>>>>>> and Biomedical Engineering
>>>>>> University of Wisconsin-Madison
>>>>>> 
>>>>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>>>>> Co-Director, Raising the Floor - International
>>>>>> and the Global Public Inclusive Infrastructure Project
>>>>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>>>> 
>>>>>> 
>>>>>> On Oct 3, 2012, at 10:44 PM, Colin Clark <colinbdclark at gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> We should be using ISO 639-2 language codes throughout the system.
>>>>>>> If
>>>>>>> not, it's a bug.
>>>>>>> 
>>>>>>> If I remember correctly, this was probably introduced by the UI
>>>>>>> Options
>>>>>>> team who were integrating at very short notice with the GPII
>>>>>>> framework.
>>>>>>> I believe UI Options can support both two- and three-character
>>>>>>> language
>>>>>>> codes (as is often the case).
>>>>>>> 
>>>>>>> As a speaker of "eng-CA", I don't see any reason not to simply use
>>>>>>> ISO
>>>>>>> 639-2 from the start and to also support country codes, as
>>>>>>> Christophe
>>>>>>> suggests. I also think it's probably worth supporting the
>>>>>>> two-character
>>>>>>> subset for interoperability if possible.
>>>>>>> 
>>>>>>> Colin
>>>>>>> 
>>>>>>> On 2012-10-03, at 1:18 PM, Gregg Vanderheiden wrote:
>>>>>>> 
>>>>>>>> I think that having language and country codes is a great idea.
>>>>>>>> 
>>>>>>>> Wd DO need to decide which codes to use.  I think the square
>>>>>>>> brackets
>>>>>>>> were because an official decision was not made yet
>>>>>>>> 
>>>>>>>> But I think using the ISO codes for both would be the right thing
>>>>>>>> to
>>>>>>>> do.  I added the arch list to see if someone knows  why two letter
>>>>>>>> codes are currently used.  (W3C?)
>>>>>>>> 
>>>>>>>> We also should say something like  "if no country is specified then
>>>>>>>> ...."
>>>>>>>> (is there a default country for all languages specified somewhere?)
>>>>>>>> we might say the country of origin -- but I'm not sure all
>>>>>>>> languages
>>>>>>>> have an (existing) country of origin anymore.
>>>>>>>> 
>>>>>>>> Good catch Christophe.
>>>>>>>> Lets get a decision and then record it in the Glossary.
>>>>>>>> 
>>>>>>>> I wonder if we should have a decision registry somewhere since we
>>>>>>>> have
>>>>>>>> so many people involved.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Gregg
>>>>>>>> --------------------------------------------------------
>>>>>>>> Gregg Vanderheiden Ph.D.
>>>>>>>> Director Trace R&D Center
>>>>>>>> Professor Industrial & Systems Engineering
>>>>>>>> and Biomedical Engineering
>>>>>>>> University of Wisconsin-Madison
>>>>>>>> 
>>>>>>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>>>>>>> Co-Director, Raising the Floor - International
>>>>>>>> and the Global Public Inclusive Infrastructure Project
>>>>>>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Oct 3, 2012, at 11:43 AM, Christophe Strobbe
>>>>>>>> <christophestrobbe at yahoo.co.uk> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> While creating a preference set for one of the personas in the
>>>>>>>>> Cloud4all smarthouse simulation
>>>>>>>>> <http://wiki.gpii.net/index.php/SmartHouses_Preference_Sets>, I
>>>>>>>>> looked
>>>>>>>>> into language codes and found the following:
>>>>>>>>> (1) ISO/IEC 24751:2008 (all subparts) refer to ISO 639-2:1998 for
>>>>>>>>> language codes. In the registry, the value space for "language" is
>>>>>>>>> [ISO 639-2/T] (I don't know the reason for the square brackets).
>>>>>>>>> According to
>>>>>>>>> <https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes>
>>>>>>>>> and <http://www.loc.gov/standards/iso639-2/php/code_list.php>, the
>>>>>>>>> ISO
>>>>>>>>> 639-2 codes are three-letter codes (e.g. "eng" for English, "dut"
>>>>>>>>> or
>>>>>>>>> "nld" for Dutch, "fre" or "fra" for French, etc). However, the
>>>>>>>>> JSON
>>>>>>>>> preference sets I've seen so far (I mean those by the
>>>>>>>>> GPII/Cloud4all
>>>>>>>>> Architecture team) use two-letter codes (see Carla's, Nisha's and
>>>>>>>>> Timothy's preference sets). Am I misreading the information I
>>>>>>>>> found
>>>>>>>>> about ISO 639-2?
>>>>>>>>> (2) Related to this is the absence of country information, i.e.
>>>>>>>>> combining a language code with a country code from ISO 3166 (see
>>>>>>>>> <http://www.loc.gov/standards/iso639-2/faq.html#22>). This is
>>>>>>>>> relevant
>>>>>>>>> to text-to-speech engines and Braille. For example for Dutch, not
>>>>>>>>> many
>>>>>>>>> people in Flanders are keen on TTS that uses pronunciation rules
>>>>>>>>> from
>>>>>>>>> the Netherlands. Braille conventions also vary between countries
>>>>>>>>> that
>>>>>>>>> use the same official language (well, they even vary between
>>>>>>>>> Braille
>>>>>>>>> centres, but let's not go into that).
>>>>>>>>> (3) Note that IETF RFC 4646 <http://tools.ietf.org/html/rfc4646>
>>>>>>>>> gives
>>>>>>>>> preference to the shortest ISO 639 code (2 or three letters) that
>>>>>>>>> is
>>>>>>>>> available for a language (check the ABNF syntax under
>>>>>>>>> <http://tools.ietf.org/html/rfc4646#section-2.1>). This base code
>>>>>>>>> can
>>>>>>>>> then be combined with an ISO 3166 country code, to create tags
>>>>>>>>> like
>>>>>>>>> en-US (American English) and en-GB (British English). However,
>>>>>>>>> IETF
>>>>>>>>> RFC 4646 is referenced neither by ISO 24751 nor by the registry.
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> 
>>>>>>>>> Christophe Strobbe
>>>>>>>>> 
>>>>>>> 
>>>>>>> ---
>>>>>>> Colin Clark
>>>>>>> Technical Lead, Fluid Project
>>>>>>> http://fluidproject.org
>>>>> 
>>>>> --
>>>>> Christophe Strobbe
>>> 
>>> 
>>> --
>>> Christophe Strobbe
>>> 
>>> _______________________________________________
>>> Accessforall mailing list
>> 
>> 
>> --
>> Christophe Strobbe
> 
> 
> -- 
> Christophe Strobbe
> Akademischer Mitarbeiter
> Adaptive User Interfaces Research Group
> Hochschule der Medien
> Nobelstraße 10
> 70569 Stuttgart
> Tel. +49 711 8923 2749
> 
> _______________________________________________
> Architecture mailing list
> Architecture at lists.gpii.net
> http://lists.gpii.net/cgi-bin/mailman/listinfo/architecture

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idrc.ocad.ca/pipermail/accessforall/attachments/20121023/9148934b/attachment-0001.html>


More information about the Accessforall mailing list