[Accessforall] Codes for languages in ISO 24751 and the registry

Christophe Strobbe strobbe at hdm-stuttgart.de
Thu Oct 4 14:14:48 EDT 2012


Hi Liddy,


Am Do, 4.10.2012, 14:23 schrieb Liddy Nevile:
> Can I suggest that you use the two-two character codes = one for
> language and one for country =


I'm afraid that would make it impossible to mark any content that is in a
sign language, while ISO 639-3 has the "sgn" code for this. (sgn would
need subcodes, but that is another can of worms.)


> but almost more importantly, that you
> insist everything is Unicode...otherwise being very 'multilingual
> won't work anyway ...


While I encourage the use of Unicode, I would like to know what
"everything" means in this context. We can enforce a certain character set
(Unicode) and a certain character encoding scheme (e.g. UTF-8) for things
under the control of GPII/Cloud4all, but not for any content that outside
our control, e.g. we can't prevent Chinese websites serving content in the
outdated GB2312 character set.

Even internally, using Unicode can pose problems if you need to process it
with JavaScript because characters outside the Basic Multilingual Plane
require workarounds (you can read about surrogate pairs etc at
<http://stackoverflow.com/questions/3744721/javascript-strings-outside-of-the-bmp>,
<http://blog.jochentopf.com/2011-03-17-javascript-and-unicode.html> and
<http://mathiasbynens.be/notes/javascript-encoding>).

Best regards,

Christophe

>
> Liddy
>
> On 04/10/2012, at 9:48 PM, Christophe Strobbe wrote:
>
>>
>> A few things to bear in mind before making this decision:
>> 1. ISO 639-2 (or any other part of ISO 639) just covers the codes
>> for the
>> identification of languages, not subcodes for countries, scripts, etc.
>> 2. IETF RFC 4646 describes how to combine ISO 639 language codes
>> with ISO
>> 3166 country codes (and other optional subtags), but prefers two-
>> letter
>> language codes over three-letter codes if the former type of code is
>> available. So that would gives us en-CA instead of eng-CA. So if we
>> want
>> to use codes like en-CA, we should refer to IETF RFC 4646; in order
>> to use
>> tags like eng-CA, we would need to invent our own "standard" for
>> language
>> codes. If we prefer IETF RFC 4646 tags, we will need to check if ISO
>> standards can use IETF RFCs as normative references.
>> 3. The two-letter language code is what you find in HTML pages, the
>> OpenDocument format, and many other formats. That might be the
>> reason why
>> this type of code was in the sample preference sets. If we use
>> three-letter codes, some parts of the GPII/Cloud4all architecture will
>> need to refer to a table that maps two-letter codes to three-letter
>> codes,
>> because the two-letter codes seem to be the dominant convention (but
>> that
>> might change; e.g. Dublin Core seems to accept both types of codes).
>>
>>
>> I am not speaking against using codes like eng-CA, but we should
>> know what
>> the impact of this decision would be.
>>
>>
>> Best regards,
>>
>> Christophe
>>
>> Am Do, 4.10.2012, 07:18 schrieb Gregg Vanderheiden:
>>> OK
>>>
>>> 	Does anyone want to SPEAK AGAINST doing as Colin outlined which
>>> seems to
>>> be in line with everyone else's comments.
>>>
>>> 	  If so please post any counter thoughts in the next few days.
>>> We have
>>> everyone I think on the two lists attached so we can make a
>>> decision if
>>> there are no counter proposals to consider
>>>
>>> thanks
>>>
>>>
>>> Gregg
>>> --------------------------------------------------------
>>> Gregg Vanderheiden Ph.D.
>>> Director Trace R&D Center
>>> Professor Industrial & Systems Engineering
>>> and Biomedical Engineering
>>> University of Wisconsin-Madison
>>>
>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>> Co-Director, Raising the Floor - International
>>> and the Global Public Inclusive Infrastructure Project
>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>
>>>
>>> On Oct 3, 2012, at 10:44 PM, Colin Clark <colinbdclark at gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We should be using ISO 639-2 language codes throughout the system.
>>>> If
>>>> not, it's a bug.
>>>>
>>>> If I remember correctly, this was probably introduced by the UI
>>>> Options
>>>> team who were integrating at very short notice with the GPII
>>>> framework.
>>>> I believe UI Options can support both two- and three-character
>>>> language
>>>> codes (as is often the case).
>>>>
>>>> As a speaker of "eng-CA", I don't see any reason not to simply use
>>>> ISO
>>>> 639-2 from the start and to also support country codes, as
>>>> Christophe
>>>> suggests. I also think it's probably worth supporting the two-
>>>> character
>>>> subset for interoperability if possible.
>>>>
>>>> Colin
>>>>
>>>> On 2012-10-03, at 1:18 PM, Gregg Vanderheiden wrote:
>>>>
>>>>> I think that having language and country codes is a great idea.
>>>>>
>>>>> Wd DO need to decide which codes to use.  I think the square
>>>>> brackets
>>>>> were because an official decision was not made yet
>>>>>
>>>>> But I think using the ISO codes for both would be the right thing
>>>>> to
>>>>> do.  I added the arch list to see if someone knows  why two letter
>>>>> codes are currently used.  (W3C?)
>>>>>
>>>>> We also should say something like  "if no country is specified then
>>>>> ...."
>>>>> (is there a default country for all languages specified somewhere?)
>>>>> we might say the country of origin -- but I'm not sure all
>>>>> languages
>>>>> have an (existing) country of origin anymore.
>>>>>
>>>>> Good catch Christophe.
>>>>> Lets get a decision and then record it in the Glossary.
>>>>>
>>>>> I wonder if we should have a decision registry somewhere since we
>>>>> have
>>>>> so many people involved.
>>>>>
>>>>>
>>>>> Gregg
>>>>> --------------------------------------------------------
>>>>> Gregg Vanderheiden Ph.D.
>>>>> Director Trace R&D Center
>>>>> Professor Industrial & Systems Engineering
>>>>> and Biomedical Engineering
>>>>> University of Wisconsin-Madison
>>>>>
>>>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>>>> Co-Director, Raising the Floor - International
>>>>> and the Global Public Inclusive Infrastructure Project
>>>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Oct 3, 2012, at 11:43 AM, Christophe Strobbe
>>>>> <christophestrobbe at yahoo.co.uk> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> While creating a preference set for one of the personas in the
>>>>>> Cloud4all smarthouse simulation
>>>>>> <http://wiki.gpii.net/index.php/SmartHouses_Preference_Sets>, I
>>>>>> looked
>>>>>> into language codes and found the following:
>>>>>> (1) ISO/IEC 24751:2008 (all subparts) refer to ISO 639-2:1998 for
>>>>>> language codes. In the registry, the value space for "language" is
>>>>>> [ISO 639-2/T] (I don't know the reason for the square brackets).
>>>>>> According to <https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>>>>> >
>>>>>> and <http://www.loc.gov/standards/iso639-2/php/code_list.php>,
>>>>>> the ISO
>>>>>> 639-2 codes are three-letter codes (e.g. "eng" for English,
>>>>>> "dut" or
>>>>>> "nld" for Dutch, "fre" or "fra" for French, etc). However, the
>>>>>> JSON
>>>>>> preference sets I've seen so far (I mean those by the GPII/
>>>>>> Cloud4all
>>>>>> Architecture team) use two-letter codes (see Carla's, Nisha's and
>>>>>> Timothy's preference sets). Am I misreading the information I
>>>>>> found
>>>>>> about ISO 639-2?
>>>>>> (2) Related to this is the absence of country information, i.e.
>>>>>> combining a language code with a country code from ISO 3166 (see
>>>>>> <http://www.loc.gov/standards/iso639-2/faq.html#22>). This is
>>>>>> relevant
>>>>>> to text-to-speech engines and Braille. For example for Dutch,
>>>>>> not many
>>>>>> people in Flanders are keen on TTS that uses pronunciation rules
>>>>>> from
>>>>>> the Netherlands. Braille conventions also vary between countries
>>>>>> that
>>>>>> use the same official language (well, they even vary between
>>>>>> Braille
>>>>>> centres, but let's not go into that).
>>>>>> (3) Note that IETF RFC 4646 <http://tools.ietf.org/html/rfc4646>
>>>>>> gives
>>>>>> preference to the shortest ISO 639 code (2 or three letters)
>>>>>> that is
>>>>>> available for a language (check the ABNF syntax under
>>>>>> <http://tools.ietf.org/html/rfc4646#section-2.1>). This base
>>>>>> code can
>>>>>> then be combined with an ISO 3166 country code, to create tags
>>>>>> like
>>>>>> en-US (American English) and en-GB (British English). However,
>>>>>> IETF
>>>>>> RFC 4646 is referenced neither by ISO 24751 nor by the registry.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Christophe Strobbe
>>>>>>
>>>>>> _______________________________________________
>>>>>> Accessforall mailing list
>>>>>> Accessforall at fluidproject.org
>>>>>> http://lists.idrc.ocad.ca/cgi-bin/mailman/listinfo/accessforall
>>>>>
>>>>> _______________________________________________
>>>>> Accessforall mailing list
>>>>> Accessforall at fluidproject.org
>>>>> http://lists.idrc.ocad.ca/cgi-bin/mailman/listinfo/accessforall
>>>>
>>>> ---
>>>> Colin Clark
>>>> Technical Lead, Fluid Project
>>>> http://fluidproject.org
>>>>
>>>
>>>
>>
>>
>> --
>> Christophe Strobbe
>> Akademischer Mitarbeiter
>> Adaptive User Interfaces Research Group
>> Hochschule der Medien
>> Nobelstraße 10
>> 70569 Stuttgart
>> Tel. +49 711 8923 2749
>>
>> _______________________________________________
>> Accessforall mailing list
>> Accessforall at fluidproject.org
>> http://lists.idrc.ocad.ca/cgi-bin/mailman/listinfo/accessforall
>
>


-- 
Christophe Strobbe
Akademischer Mitarbeiter
Adaptive User Interfaces Research Group
Hochschule der Medien
Nobelstraße 10
70569 Stuttgart
Tel. +49 711 8923 2749



More information about the Accessforall mailing list