Internationalized Software
Ray Toal
Loyola Marymount University
2004-03-03
Outline
What This Talk is About
What This Talk is NOT About
This is Wrong
public class Greeting { public static void main(String[] args) { System.out.println("Good morning"); System.out.println("Good evening"); } }
Internationalization (I18n)
Let's get the user-visible text out of the code:
import java.util.ResourceBundle; public class Greeter { public static void main(String[] args) throws Exception { ResourceBundle bundle = ResourceBundle.getBundle("example"); System.out.println(bundle.getString("morning_greeting")); System.out.println(bundle.getString("late_greeting")); } }
Localization (L10n)
Fill in the blanks for specific regions (easiest way in Java is to use propeties files):
morning_greeting=Good morning midday_greeting=Good afternoon late_greeting=Good evening
morning_greeting=Buenos días midday_greeting=Buenas tardes late_greeting=Buenas noches
morning_greeting=G'day midday_greeting=G'day late_greeting=G'day
morning_greeting=\u0414\u043e\u0431\u0440\u043e\u0435 \u0443\u0442\u0440\u043e midday_greeting=\u0414\u043e\u0431\u0440\u044b\0439 \u0434\u0435\u043d\u044c late_greeting=\u0414\u043e\u0431\u0440\u044b\u0439 \u0432\u0435\u0447\u0435\u0440
Globalization (G11n)
Globalization is Internationalization + Localization
java Greeter java -Duser.language=es Greeter java -Duser.language=en -Duser.region=AU Greeter java -Duser.language=ru Greeter
Locales
ar ar_AE ar_BH ar_DZ ar_EG ar_IQ ar_JO ar_KW ar_LB ar_LY ar_MA ar_OM ar_QA ar_SA ar_SD ar_SY ar_TN ar_YE hi_IN iw iw_IL ja ja_JP ko ko_KR th th_TH th_TH_TH zh zh_CN zh_HK zh_TW be be_BY bg bg_BG ca ca_ES cs cs_CZ da da_DK de de_AT de_CH de_DE de_LU el el_GR en_AU en_CA en_GB en_IE en_IN en_NZ en_ZA es es_AR es_BO es_CL es_CO es_CR es_DO es_EC es_ES es_GT es_HN es_MX es_NI es_PA es_PE es_PR es_PY es_SV es_UY es_VE et et_EE fi fi_FI fr fr_BE fr_CA fr_CH fr_FR fr_LU hr hr_HR hu hu_HU is is_IS it it_CH it_IT lt lt_LT lv lv_LV mk mk_MK nl nl_BE nl_NL no no_NO no_NO_NY pl pl_PL pt pt_BR pt_PT ro ro_RO ru ru_RU sh sh_YU sk sk_SK sl sl_SI sq sq_AL sr sr_YU sv sv_SE tr tr_TR uk uk_UA en en_US
What if the Requested Locale is Missing?
xxx_no_NO_NY.properties xxx_no_NO.properties xxx_no.properties xxx_en_US_TX.properties xxx_en_US.properties xxx_en.properties xxx.properties
Getting Locales At Runtime
import java.util.Locale; import java.util.ResourceBundle; public class AnotherGreeter { public static void main(String[] args) throws Exception { Locale locale; if (args.length == 0) { locale = Locale.getDefault(); } else if (args.length == 1) { locale = new Locale(args[0]); } else if (args.length == 2) { locale = new Locale(args[0], args[1]); } else { locale = new Locale(args[0], args[1], args[2]); } ResourceBundle bundle = ResourceBundle.getBundle("example", locale); System.out.println("Using locale " + bundle.getLocale()); System.out.println(bundle.getString("morning_greeting")); System.out.println(bundle.getString("late_greeting")); } }
Resource Files
java.util.ListResourceBundle
)Wait — It's not JUST about Translation!
Message Formatting
message=There were {0} spelling mistakes in file {1}
message=Datei {1} enthält {0} Rechtschreibfehler
String pattern = bundle.getResourceString("message"); MessageFormat.format(pattern, new String[]{Integer.valueOf(count), filename});
Number Formatting
import java.text.NumberFormat; import java.util.Locale; public class Numbers { public static void main(String[] args) throws Exception { final double amount = 123456.789; Locale[] locales = NumberFormat.getAvailableLocales(); System.out.println("<html><body><table>"); for (int i = 0; i < locales.length; i++) { NumberFormat nf = NumberFormat.getInstance(locales[i]); NumberFormat cf = NumberFormat.getCurrencyInstance(locales[i]); NumberFormat pf = NumberFormat.getPercentInstance(locales[i]); System.out.println("<tr><td>" + locales[i] + "</td><td>" + nf.format(amount) + "</td><td>" + cf.format(amount) + "</td><td>" + pf.format(amount) + "</td></tr>"); } System.out.println("</table></body></html>"); } }
Date Formatting
import java.text.DateFormat; import java.util.Date; import java.util.Locale; public class Dates { public static void main(String[] args) throws Exception { final Date now = new Date(); Locale[] locales = DateFormat.getAvailableLocales(); System.out.println("<html><body><table border='1' cellspacing='0'>"); for (int i = 0; i < locales.length; i++) { DateFormat shortFormat = DateFormat.getDateTimeInstance( DateFormat.SHORT, DateFormat.SHORT, locales[i]); DateFormat mediumFormat = DateFormat.getDateTimeInstance( DateFormat.MEDIUM, DateFormat.MEDIUM, locales[i]); DateFormat longFormat = DateFormat.getDateTimeInstance( DateFormat.LONG, DateFormat.LONG, locales[i]); System.out.println("<tr><td>" + locales[i] + "</td><td>" + shortFormat.format(now) + "</td><td>" + mediumFormat.format(now) + "</td><td>" + longFormat.format(now) + "</td></tr>"); } System.out.println("</table></body></html>"); } }
String Comparison (for Searching and Sorting)
Character Sets
Unicode
25 PERCENT SIGN 2B PLUS SIGN 54 LATIN CAPITAL LETTER T 5D RIGHT SQUARE BRACKET B0 DEGREE SIGN C9 LATIN CAPITAL LETTER E WITH ACUTE 2AD LATIN LETTER BIDENTAL PERCUSSIVE 39B GREEK CAPITAL LETTER LAMDA 446 CYRILLIC SMALL LETTER TSE 543 ARMENIAN CAPITAL LETTER CHEH 5E6 HEBREW LETTER TSADI 635 ARABIC LETTER SAD 784 THAANA LETTER BAA 94A DEVANAGARI VOWEL SIGN SHORT O 9D7 BENGALI AU LENGTH MARK BEF TAMIL DIGIT NINE D93 SINHALA LETTER AIYANNA F0A TIBETAN MARK BKA- SHOG YIG MGO 11C7 HANGUL JONGSEONG NIEUN-SIOS 1293 ETHIOPIC SYLLABLE NAA 13CB CHEROKEE LETTER QUV 2023 TRIANGULAR BULLET 20A4 LIRA SIGN 2105 CARE OF 213A ROTATED CAPITAL Q 21B7 CLOCKWISE TOP SEMICIRCLE ARROW 2226 NOT PARALLEL TO 2234 THEREFORE 265E BLACK CHESS KNIGHT 1D111 MUSICAL SYMBOL FERMATA BELOW 1D122 MUSICAL SYMBOL F CLEF
One still needs to note the difference between text elements and characters:
Character Encoding Schemes
UTF-32, UTF-16, and UTF-8
UTF-32: four bytes per character. Real simple.
UTF-16: two or four bytes per character.
Character Range | Bit Encoding |
---|---|
U+0000 ... U+FFFF | xxxxxxxx xxxxxxxx |
U+10000 ... U+10FFFF | let y = X-1000016 in 110110yy yyyyyyyy 110111yy yyyyyyyy |
UTF-16 simply cannot encode codepoints beyond U+10FFFF. So far this is not a problem. Note also that the existence of UTF-16, and its blessing by the Unicode Consortium means that U+D800 through U+DFFF cannot be legal characters. Hack!?
UTF-8: one to six byte per character.
Character Range | Bit Encoding | (Bits) |
---|---|---|
U+0000 ... U+007F | 0xxxxxxx | 7 |
U+0080 ... U+07FF | 110xxxxx 10xxxxxx | 11 |
U+0800 ... U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx | 16 |
U+10000 ... U+1FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 21 |
U+200000 ... U+3FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 26 |
U+4000000 ... U+7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 31 |
UTF-8 rocks. The number of advantages it has is stunning. For examples:
J2EE Web Tier and Struts I18n
<message-resources parameter="com.mycompany.myproject.main"/>and a convenient bean:message tag.
<p><bean message key="visitor" arg0="${count}" /><p>
Perl I18n
use strict; use warnings; use charnames ':full'; my $s = "\x{0414}\x{043e}\x{0431}\x{0440}\x{043e}" ."\x{0435}\x{20}\x{0443}\x{0442}\x{0440}\x{043e}"; my $t = "\N{MUSIC FLAT SIGN}"; print "$s has length ", length $s, "\n"; {use bytes; print "$s has length ", bytes::length $s, "\n";} my @a = unpack ("U*", $s); print "@a\n"; @a = unpack ("C*", $s); print "@a\n"; print "$t has length ", length $t, "\n"; {use bytes; print "$t has length ", bytes::length $t, "\n";}
XML I18n
<?xml version="1.0" encoding="utf-8" ?> <doc xml:lang="en"> <list title="Titre en français" xml:lang="fr"> ><p>Texte en français.</p> <p xml:lang="fr-ca">Texte en québécquois.</p> <p xml:lang="en">Second text in English.</p> </list> <p>Text in English.</p> </doc>
Some Internet Issues
Stuff We Didn't Cover
Summary