Software Internationalization

Internationalized Software

Ray Toal

Loyola Marymount University

2004-03-03

Outline

Motivation: What not to do
Some Definitions
Things to Localize
Text: Character Sets and Encodings
Java I18n HOWTO
Struts I18n
Perl I18n
XML I18n

What This Talk is About

This is a practical talk
You'll see concrete examples of internationalized code
We'll try and look at everything that needs localization...
...and show how this is done

What This Talk is NOT About

Writing Systems
Marketing Software in Multiple Countries
Upgrading Existing Non-Globalized Software
Automatic Translation
Existing Products, Language Packs, etc.

This is Wrong

Greeter.java

public class Greeting {
    public static void main(String[] args) {
        System.out.println("Good morning");
        System.out.println("Good evening");
    }
}

Internationalization (I18n)

Let's get the user-visible text out of the code:

Greeter.java

import java.util.ResourceBundle;

public class Greeter {
    public static void main(String[] args) throws Exception {
        ResourceBundle bundle = ResourceBundle.getBundle("example");
        System.out.println(bundle.getString("morning_greeting"));
        System.out.println(bundle.getString("late_greeting"));
    }
}

Localization (L10n)

Fill in the blanks for specific regions (easiest way in Java is to use propeties files):

example.properties

morning_greeting=Good morning
midday_greeting=Good afternoon
late_greeting=Good evening

example_es.properties

morning_greeting=Buenos días
midday_greeting=Buenas tardes
late_greeting=Buenas noches

example_en_AU.properties

morning_greeting=G'day
midday_greeting=G'day
late_greeting=G'day

example_ru.properties

morning_greeting=\u0414\u043e\u0431\u0440\u043e\u0435 \u0443\u0442\u0440\u043e
midday_greeting=\u0414\u043e\u0431\u0440\u044b\0439 \u0434\u0435\u043d\u044c
late_greeting=\u0414\u043e\u0431\u0440\u044b\u0439 \u0432\u0435\u0447\u0435\u0440

Globalization (G11n)

Globalization is Internationalization + Localization

Oh yeah, running that example:

    java Greeter
    java -Duser.language=es Greeter
    java -Duser.language=en -Duser.region=AU Greeter
    java -Duser.language=ru Greeter

Locales

A locale is a geographic or political region or community that shares the same language, customs, or cultural conventions.
Three parts: LANGUAGE + COUNTRY + VARIANT

These locales are available on my laptop:

ar      ar_AE   ar_BH   ar_DZ   ar_EG   ar_IQ   ar_JO   ar_KW   ar_LB   ar_LY
ar_MA   ar_OM   ar_QA   ar_SA   ar_SD   ar_SY   ar_TN   ar_YE   hi_IN   iw
iw_IL   ja      ja_JP   ko      ko_KR   th      th_TH   th_TH_TH        zh
zh_CN   zh_HK   zh_TW   be      be_BY   bg      bg_BG   ca      ca_ES   cs
cs_CZ   da      da_DK   de      de_AT   de_CH   de_DE   de_LU   el      el_GR
en_AU   en_CA   en_GB   en_IE   en_IN   en_NZ   en_ZA   es      es_AR   es_BO
es_CL   es_CO   es_CR   es_DO   es_EC   es_ES   es_GT   es_HN   es_MX   es_NI
es_PA   es_PE   es_PR   es_PY   es_SV   es_UY   es_VE   et      et_EE   fi
fi_FI   fr      fr_BE   fr_CA   fr_CH   fr_FR   fr_LU   hr      hr_HR   hu
hu_HU   is      is_IS   it      it_CH   it_IT   lt      lt_LT   lv      lv_LV
mk      mk_MK   nl      nl_BE   nl_NL   no      no_NO   no_NO_NY        pl
pl_PL   pt      pt_BR   pt_PT   ro      ro_RO   ru      ru_RU   sh      sh_YU
sk      sk_SK   sl      sl_SI   sq      sq_AL   sr      sr_YU   sv      sv_SE
tr      tr_TR   uk      uk_UA   en      en_US

What if the Requested Locale is Missing?

System uses search order to find the bundle.

Example: If you ask for no_NO_NY and the default is en_US_TX, the search order is

xxx_no_NO_NY.properties
xxx_no_NO.properties
xxx_no.properties
xxx_en_US_TX.properties
xxx_en_US.properties
xxx_en.properties
xxx.properties

Getting Locales At Runtime

This takes the locale from the command line (a good idea)

import java.util.Locale;
import java.util.ResourceBundle;

public class AnotherGreeter {
    public static void main(String[] args) throws Exception {
        Locale locale;
        if (args.length == 0) {
            locale = Locale.getDefault();
        } else if (args.length == 1) {
            locale = new Locale(args[0]);
        } else if (args.length == 2) {
            locale = new Locale(args[0], args[1]);
        } else {
            locale = new Locale(args[0], args[1], args[2]);
        }

        ResourceBundle bundle = ResourceBundle.getBundle("example", locale);
        System.out.println("Using locale " + bundle.getLocale());
        System.out.println(bundle.getString("morning_greeting"));
        System.out.println(bundle.getString("late_greeting"));
    }
}

Resource Files

Note that the text is not written into the code
If it were, human translators (who probably are not experienced coders) would have to mess with source (yikes)
Expert programmers factor like this anyway
Programmers should appreciate keeping long text strings out of the source (where to break the lines?) :-)
Java properties files must be encoded in ISO 8859-1 (No, I don't know why)
Actually you can put more than text in a resource file (images, sounds, etc. are good candidates) but then you have to write code (see java.util.ListResourceBundle)

Languages and Countries

Languages are defined in ISO 639
Countries are defined in ISO 3166

Wait — It's not JUST about Translation!

There are many linguistic and cultural issues to deal with:
- What characters are letters, numbers, symbols?
- Are there word breaks? Line breaks? How to use punctuation?
- What direction is text written?
- How, exactly do you sort? (Uppercase/lowercase? Diatrics? Letter combinations?)
- Which calendar is being used? What's the first day of the week? Are there months? How many? What's the deal with time zones? Is Daylight Savings Time in use?
- How do we write dates? Currencies? Numbers? Percentages?
- How are colors (culturally) interpreted? (E.g. white represents mourning or death in Eastern cultures, but Western cultures use black. Red is purity in India, but danger in the U.S.)
- An image or icon acceptable to one culture might be offensive to another
- How does one "input" data from character sets with tens of thousands of characters? A big keyboard?

Message Formatting

Use whole messages, not pieces of messages, to deal with differences in word order. This example is from the O'Reilly book:

In the English resource file:

message=There were {0} spelling mistakes in file {1}

In the German resource file:

message=Datei {1} enthält {0} Rechtschreibfehler

Calling:

    String pattern = bundle.getResourceString("message");
    MessageFormat.format(pattern,
        new String[]{Integer.valueOf(count), filename});

Okay, well scripting languages aren't so verbose....

Number Formatting

Different radix separators, thousands separators, position of negative sign (if indeed there is a symbol for it), position of currency symbols, percentage symbols, etc.

Here is an example:

import java.text.NumberFormat;
import java.util.Locale;

public class Numbers {
    public static void main(String[] args) throws Exception {
        final double amount = 123456.789;
        Locale[] locales = NumberFormat.getAvailableLocales();
        System.out.println("<html><body><table>");
        for (int i = 0; i < locales.length; i++) {
            NumberFormat nf = NumberFormat.getInstance(locales[i]);
            NumberFormat cf = NumberFormat.getCurrencyInstance(locales[i]);
            NumberFormat pf = NumberFormat.getPercentInstance(locales[i]);
            System.out.println("<tr><td>" + locales[i]
                + "</td><td>" + nf.format(amount)
                + "</td><td>" + cf.format(amount)
                + "</td><td>" + pf.format(amount) + "</td></tr>");
        }
        System.out.println("</table></body></html>");
    }
}

This is the output
You can also format numbers yourself...

Date Formatting

Different radix separators, thousands separators, position of negative sign (if indeed there is a symbol for it), position of currency symbols, percentage symbols, etc.

Here is an example:

import java.text.DateFormat;
import java.util.Date;
import java.util.Locale;

public class Dates {
    public static void main(String[] args) throws Exception {
        final Date now = new Date();
        Locale[] locales = DateFormat.getAvailableLocales();
        System.out.println("<html><body><table border='1' cellspacing='0'>");
        for (int i = 0; i < locales.length; i++) {
            DateFormat shortFormat = DateFormat.getDateTimeInstance(
                DateFormat.SHORT, DateFormat.SHORT, locales[i]);
            DateFormat mediumFormat = DateFormat.getDateTimeInstance(
                DateFormat.MEDIUM, DateFormat.MEDIUM, locales[i]);
            DateFormat longFormat = DateFormat.getDateTimeInstance(
                DateFormat.LONG, DateFormat.LONG, locales[i]);
            System.out.println("<tr><td>" + locales[i]
                + "</td><td>" + shortFormat.format(now)
                + "</td><td>" + mediumFormat.format(now)
                + "</td><td>" + longFormat.format(now) + "</td></tr>");
        }
        System.out.println("</table></body></html>");
    }
}

This is the output
You can also format dates and times individually
You can also format dates yourself...

String Comparison (for Searching and Sorting)

Different Languages have different sorting rules.
CH and LL are single letters in traditional Spanish.
Different languages put marked characters in different places (ö comes at the end in Icelandic, but between O and P in Turkish)
In German, the ö, ü, and ä, act like oe, ue and ae for sorting
Also, what about "combining marks" (accents, etc.)?
Solution is to use a collator.
(Sorry, no examples done yet.)

Character Sets

Character = an abstract symbol, like PLUS SIGN or LATIN CAPITAL LETTER A or MUSICAL FLAT SIGN
Coded Character Set, or Codeset = Repetroire plus mapping from positive integers to characters.
Codepoint of a character in a codeset — the number associated with it.
Glyph - a picture of a character.
ANGSTROM SIGN and LATIN CAPITAL LETTER A WITH RING ABOVE are different characters but have the same glyph.
Examples of character sets: Unicode, UCS, ASCII, ISO8859-x (x in 0..15).

Unicode

Here is a snippet

     25 PERCENT SIGN
     2B PLUS SIGN
     54 LATIN CAPITAL LETTER T
     5D RIGHT SQUARE BRACKET
     B0 DEGREE SIGN
     C9 LATIN CAPITAL LETTER E WITH ACUTE
    2AD LATIN LETTER BIDENTAL PERCUSSIVE
    39B GREEK CAPITAL LETTER LAMDA
    446 CYRILLIC SMALL LETTER TSE
    543 ARMENIAN CAPITAL LETTER CHEH
    5E6 HEBREW LETTER TSADI
    635 ARABIC LETTER SAD
    784 THAANA LETTER BAA
    94A DEVANAGARI VOWEL SIGN SHORT O
    9D7 BENGALI AU LENGTH MARK
    BEF TAMIL DIGIT NINE
    D93 SINHALA LETTER AIYANNA
    F0A TIBETAN MARK BKA- SHOG YIG MGO
   11C7 HANGUL JONGSEONG NIEUN-SIOS
   1293 ETHIOPIC SYLLABLE NAA
   13CB CHEROKEE LETTER QUV
   2023 TRIANGULAR BULLET
   20A4 LIRA SIGN
   2105 CARE OF
   213A ROTATED CAPITAL Q
   21B7 CLOCKWISE TOP SEMICIRCLE ARROW
   2226 NOT PARALLEL TO
   2234 THEREFORE
   265E BLACK CHESS KNIGHT
  1D111 MUSICAL SYMBOL FERMATA BELOW
  1D122 MUSICAL SYMBOL F CLEF

One still needs to note the difference between text elements and characters:
- LATIN CAPITAL LETTER A (\u0041) COMBINING TILDE (\u0303) [two characters, one text element, decomposed]
- LATIN CAPITAL LETTER A WITH TILDE (\u00c3) [precomposed]

Character Encoding Schemes

Data is stored and transmitted in bytes (bits, octets, whatever)
Everyone knows that numbers are encoded into bytes (one's complement, two's complement, IEEE-754 single, IEEE-754 double, etc.)
How are characters encoded?
Lots of ways: Direct encoding (for small codesets), UTF-8, UTF-16, UTF-32, many others

UTF-32, UTF-16, and UTF-8

UTF-32: four bytes per character. Real simple.
UTF-16: two or four bytes per character.

Character Range Bit Encoding

U+0000 ... U+FFFF xxxxxxxx xxxxxxxx

U+10000 ... U+10FFFF let y = X-10000₁₆ in
110110yy yyyyyyyy 110111yy yyyyyyyy

UTF-16 simply cannot encode codepoints beyond U+10FFFF. So far this is not a problem. Note also that the existence of UTF-16, and its blessing by the Unicode Consortium means that U+D800 through U+DFFF cannot be legal characters. Hack!?

Character Range	Bit Encoding
U+0000 ... U+FFFF	xxxxxxxx xxxxxxxx
U+10000 ... U+10FFFF	let y = X-10000₁₆ in 110110yy yyyyyyyy 110111yy yyyyyyyy

UTF-8: one to six byte per character.

Character Range	Bit Encoding	(Bits)
U+0000 ... U+007F	0xxxxxxx	7
U+0080 ... U+07FF	110xxxxx 10xxxxxx	11
U+0800 ... U+FFFF	1110xxxx 10xxxxxx 10xxxxxx	16
U+10000 ... U+1FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	21
U+200000 ... U+3FFFFFF	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx	26
U+4000000 ... U+7FFFFFFF	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx	31

UTF-8 rocks. The number of advantages it has is stunning. For examples:

ASCII text is unchanged in UTF-8.
Non-ASCII characters are never coded with ASCII characters.
C programmers that have assumed that the octet 00 terminates a string won't have to change their code (UTF-16 is full of these things!).
Western languages can represent text in about 1.1 bytes per character -- big savings over the other UTFs.
When decoding you can always determine if you've ended up in the middle of a multibyte encoding (no encoding starts with 10).
The number of leading ones in the first byte tells you how many bytes the character needs.
Encoding and decoding are done with shifts and logical bitmask operations, no division, so processing is very fast.
You can (lexicographically) sort by simply pretending each byte is a character.
The octets FE and FF never appear, so mass confusion about byte-order marks and "end of file markers" improperly processed by lazy C programmers never occurs.

J2EE Web Tier and Struts I18n

Anything presented to an end user on the web-tier should be localized.
If you're writing a servlet, it's just Java, but if you are using JSPs...
Easy to do with tags.

Struts introduced its own scheme (location of resources in struts-config.xml:

  <message-resources parameter="com.mycompany.myproject.main"/>

and a convenient bean:message tag.

  <p><bean message key="visitor" arg0="${count}" /><p>

The JSTL has a number of tags that do this: <fmt:setLocale>, <fmt:bundle>, <fmt:message>, <fmt:setBundle>, <fmt:formatNumber>, <fmt:formatDate> <fmt:parseDate>, <fmt:parseNumber>, <fmt:setTimeZone>, <fmt:timeZone>

Perl I18n

Perl has locales
Perl has formats
Perl 5.8 and above uses Unicode very well

Here's a little example showing Unicode in strings

use strict;
use warnings;
use charnames ':full';

my $s = "\x{0414}\x{043e}\x{0431}\x{0440}\x{043e}"
       ."\x{0435}\x{20}\x{0443}\x{0442}\x{0440}\x{043e}";
my $t = "\N{MUSIC FLAT SIGN}";

print "$s has length ", length $s, "\n";
{use bytes; print "$s has length ", bytes::length $s, "\n";}
my @a = unpack ("U*", $s);
print "@a\n";
@a = unpack ("C*", $s);
print "@a\n";

print "$t has length ", length $t, "\n";
{use bytes; print "$t has length ", bytes::length $t, "\n";}

XML I18n

XML uses Unicode natively
An XML document is NOTHING BUT UNICODE characters
You can specify a character encoding in the XML declaration

Use the xml:lang attribute

<?xml version="1.0" encoding="utf-8" ?>
<doc xml:lang="en">
 <list title="Titre en français" xml:lang="fr">
  ><p>Texte en français.</p>
  <p xml:lang="fr-ca">Texte en québécquois.</p>
  <p xml:lang="en">Second text in English.</p>
 </list>
 <p>Text in English.</p>
</doc>

This file comes from the XML Internationalization FAQ.

Some Internet Issues

Many protocols allow one to specify character encodings
But some protocols require ASCII
You can use UTF-7 in those cases

Stuff We Didn't Cover

More details of locales in Perl
Anything about C
Java's Input Method Framework
Fonts and rendering
Commercial Products (e.g. Content Director, System 4, WorldServer)

Summary

I18n is important
I18n often makes your code clearer
I18n is pretty easy
Lots of things should be localized