[Accessforall] Codes for languages in ISO 24751 and the registry

Christophe Strobbe strobbe at hdm-stuttgart.de
Thu Oct 4 13:52:25 EDT 2012


Hi Andy,

Thanks of reminding me of IETF BCP 47; that's the document I should have
referred to instead of RFC 4646.


Am Do, 4.10.2012, 15:15 schrieb Andy Heath:
> Just a slight modification ..
>
> I'm led to believe the solution of choice is to use either 639-2 or
> 639-3 as appropriate. 639-3 seems to be a slight improvement on 639-2
> (unless one needs bibliographic languages) in that (as I understand it)
> where there is a group language in part 2 (such as Arabic) that has no
> specific versions its included in part 3 as a specific not a group
> language but where there is are specific versions the general language
> is omitted.

I haven't read all of BCP 47 but my impression is that using 639-3 for
primary language subtags is more precise. See for example section 4.1.2
"Using Extended Language Subtags": For macrolanguages such as Chinese and
Arabic, varieties were often identified by means of region subtags, e.g.
zh-HK (Hong-Kong Chinese). With the adoption of BCP 47, there is now a
choice of language tags, e.g. cmn and yue (for Mandarin Chinese and
Cantonese, respectively; ISO 639-3); or zh-cmn (for compatibility); or
even still zh-HK (the old tag for Cantonese). yue-HK is a more precise way
of tagging Cantonese as spoken in Hong-Kong (as opposed to the
neighbouring province of Canton).



> I am led to believe there are more languages included in
> part 3 than part 2 but I don't know how important the extra ones are.  I
> think that for all practical purposes at this point there won't be any
> differences between 639-2 and 639-3 and its something that would be easy
> to change later but its something to watch for.


If <http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes> is a reliable
source:
* bih (Bihari) is in 639-2 but not in 639-3.
* Chinese is zho in 639-2/T, chi in 639-2/B, and zho in 639-3.
* Czch is ces in 639-2/T, cze in 639-2/B, and ces in 639-3.
* Dutch is nld in 639-2/T, dut in 639-2/B, and nld in 639-3.
* French is fra in 639-2/T, fre in 639-2/B, fra in 639-3.
* I found similar differences for Basque, Georgian, German, Greek,
Icelandic, Macedonian, Malay, Maori, Persian (macrolanguage), Romanian,
Slovak, Tibetan and Welsh.

The 639-2/B codes are the bibliographic codes, about which BCP 47 says:
"When a language has no ISO 639-1 two-character code and the ISO 639-2/T
(Terminology) code and the ISO 639-2/B (Bibliographic) code for that
language differ, only the Terminology code is defined in the IANA
registry."

So bih is the only important difference between 639-2 and 639-3 that I've
seen so far. (Bihari is a macrolanguage for Bhojpuri, Magahi and Maithili,
which are represented as bho, mag and mai, respectively, in 639-3.)



> There is also the question of using codes that aren't registered at all
> (maybe that's "yet" or maybe its not).  There is another IETF guideline
> which provides some best practices on this and extended codes and so on
> (a good bedtime read for the geeks out there)
>
> http://tools.ietf.org/html/bcp47
>
> My point is that this is a slightly moving target that may evolve a
> little but 639-2 augmented with 639-3 if needed would do the job for now
> but possibly not for ever.


I would favour BCP 47 as a reference for language tags, because the ISO
639-x standards only cover the primary language subtags (en/eng, fr/fra,
etc), while BCP 47 covers the construction of fuller language tags using
primary language subtags, script codes (e.g. Hans, Hant, Kore), country
codes etc.
Based on what I read in BCP 47, this would favour ISO 639-3 over ISO 639-2
codes.


Best regards,

Christophe

>
> andy
>> OK
>>
>> Does anyone want to SPEAK AGAINST doing as Colin outlined which seems to
>> be in line with everyone else's comments.
>>
>>    If so please post any counter thoughts in the next few days.    We
>> have everyone I think on the two lists attached so we can make a
>> decision if there are no counter proposals to consider
>>
>> thanks
>>
>> /Gregg/
>> --------------------------------------------------------
>> Gregg Vanderheiden Ph.D.
>> Director Trace R&D Center
>> Professor Industrial & Systems Engineering
>> and Biomedical Engineering
>> University of Wisconsin-Madison
>>
>> Technical Director - Cloud4all Project - http://Cloud4all.info
>> Co-Director, Raising the Floor - International
>> and the Global Public Inclusive Infrastructure Project
>> http://Raisingthefloor.org   --- http://GPII.net
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Oct 3, 2012, at 10:44 PM, Colin Clark <colinbdclark at gmail.com
>> <mailto:colinbdclark at gmail.com>> wrote:
>>
>>> Hi all,
>>>
>>> We should be using ISO 639-2 language codes throughout the system. If
>>> not, it's a bug.
>>>
>>> If I remember correctly, this was probably introduced by the UI
>>> Options team who were integrating at very short notice with the GPII
>>> framework. I believe UI Options can support both two- and
>>> three-character language codes (as is often the case).
>>>
>>> As a speaker of "eng-CA", I don't see any reason not to simply use ISO
>>> 639-2 from the start and to also support country codes, as Christophe
>>> suggests. I also think it's probably worth supporting the
>>> two-character subset for interoperability if possible.
>>>
>>> Colin
>>>
>>> On 2012-10-03, at 1:18 PM, Gregg Vanderheiden wrote:
>>>
>>>> I think that having language and country codes is a great idea.
>>>>
>>>> Wd DO need to decide which codes to use.  I think the square brackets
>>>> were because an official decision was not made yet
>>>>
>>>> But I think using the ISO codes for both would be the right thing to
>>>> do.  I added the arch list to see if someone knows  why two letter
>>>> codes are currently used.  (W3C?)
>>>>
>>>> We also should say something like  "if no country is specified then
>>>> ...."
>>>> (is there a default country for all languages specified somewhere?)
>>>> we might say the country of origin -- but I'm not sure all languages
>>>> have an (existing) country of origin anymore.
>>>>
>>>> Good catch Christophe.
>>>> Lets get a decision and then record it in the Glossary.
>>>>
>>>> I wonder if we should have a decision registry somewhere since we
>>>> have so many people involved.
>>>>
>>>>
>>>> Gregg
>>>> --------------------------------------------------------
>>>> Gregg Vanderheiden Ph.D.
>>>> Director Trace R&D Center
>>>> Professor Industrial & Systems Engineering
>>>> and Biomedical Engineering
>>>> University of Wisconsin-Madison
>>>>
>>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>>> Co-Director, Raising the Floor - International
>>>> and the Global Public Inclusive Infrastructure Project
>>>> http://Raisingthefloor.org   --- http://GPII.net
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Oct 3, 2012, at 11:43 AM, Christophe Strobbe
>>>> <christophestrobbe at yahoo.co.uk
>>>> <mailto:christophestrobbe at yahoo.co.uk>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> While creating a preference set for one of the personas in the
>>>>> Cloud4all smarthouse simulation
>>>>> <http://wiki.gpii.net/index.php/SmartHouses_Preference_Sets>, I
>>>>> looked into language codes and found the following:
>>>>> (1) ISO/IEC 24751:2008 (all subparts) refer to ISO 639-2:1998 for
>>>>> language codes. In the registry, the value space for "language" is
>>>>> [ISO 639-2/T] (I don't know the reason for the square brackets).
>>>>> According to <https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes>
>>>>> and <http://www.loc.gov/standards/iso639-2/php/code_list.php>, the
>>>>> ISO 639-2 codes are three-letter codes (e.g. "eng" for English,
>>>>> "dut" or "nld" for Dutch, "fre" or "fra" for French, etc). However,
>>>>> the JSON preference sets I've seen so far (I mean those by the
>>>>> GPII/Cloud4all Architecture team) use two-letter codes (see Carla's,
>>>>> Nisha's and Timothy's preference sets). Am I misreading the
>>>>> information I found about ISO 639-2?
>>>>> (2) Related to this is the absence of country information, i.e.
>>>>> combining a language code with a country code from ISO 3166 (see
>>>>> <http://www.loc.gov/standards/iso639-2/faq.html#22>). This is
>>>>> relevant to text-to-speech engines and Braille. For example for
>>>>> Dutch, not many people in Flanders are keen on TTS that uses
>>>>> pronunciation rules from the Netherlands. Braille conventions also
>>>>> vary between countries that use the same official language (well,
>>>>> they even vary between Braille centres, but let's not go into that).
>>>>> (3) Note that IETF RFC 4646 <http://tools.ietf.org/html/rfc4646>
>>>>> gives preference to the shortest ISO 639 code (2 or three letters)
>>>>> that is available for a language (check the ABNF syntax under
>>>>> <http://tools.ietf.org/html/rfc4646#section-2.1>). This base code
>>>>> can then be combined with an ISO 3166 country code, to create tags
>>>>> like en-US (American English) and en-GB (British English). However,
>>>>> IETF RFC 4646 is referenced neither by ISO 24751 nor by the registry.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Christophe Strobbe
>>>>>
>>>>> _______________________________________________
>>>>> Accessforall mailing list
>>>>> Accessforall at fluidproject.org <mailto:Accessforall at fluidproject.org>
>>>>> http://lists.idrc.ocad.ca/cgi-bin/mailman/listinfo/accessforall
>>>>
>>>> _______________________________________________
>>>> Accessforall mailing list
>>>> Accessforall at fluidproject.org <mailto:Accessforall at fluidproject.org>
>>>> http://lists.idrc.ocad.ca/cgi-bin/mailman/listinfo/accessforall
>>>
>>> ---
>>> Colin Clark
>>> Technical Lead, Fluid Project
>>> http://fluidproject.org
>>>
>>
>>
>>
>> _______________________________________________
>> Accessforall mailing list
>> Accessforall at fluidproject.org
>> http://lists.idrc.ocad.ca/cgi-bin/mailman/listinfo/accessforall
>>
>
>
>
> Cheers
>
> andy
> --
> __________________
> Andy Heath
> http://axelafa.com
>
>


-- 
Christophe Strobbe
Akademischer Mitarbeiter
Adaptive User Interfaces Research Group
Hochschule der Medien
Nobelstraße 10
70569 Stuttgart
Tel. +49 711 8923 2749



More information about the Accessforall mailing list