Character Encoding

How do we encode textual data?

Characters, Graphemes, and Glyphs

A character is a unit of textual information. A character has a name. Examples:

PLUS SIGN
CYRILLIC SMALL LETTER TSE
CHEROKEE LETTER TLO
BLACK CHESS KNIGHT
PITCHFORK
MUSICAL SYMBOL FERMATA BELOW

A grapheme is a minimally distinctive unit of writing in some writing system. It is what a person usually thinks of as a character. However, it may take more than one character to make up a grapheme. For example, the grapheme:

R̊

is made up of two characters (1) LATIN CAPITAL LETTER R and (2) COMBINING RING ABOVE. The grapheme:

நி

is made up of two characters (1) TAMIL LETTER NA and (2) TAMIL VOWEL SIGN I. This grapheme:

🚴🏾

is made up of two characters (1) BICYCLIST and (2) EMOJI MODIFIER FITZPATRICK TYPE-5. This grapheme:

🏄🏻‍♀

is made up of four characters (1) SURFER, (2) EMOJI MODIFIER FITZPATRICK TYPE-1-2, (3) ZERO-WIDTH JOINER, (4) FEMALE SIGN. And this grapheme:

🇨🇻

requires two characters: (1) REGIONAL INDICATOR SYMBOL LETTER C and (2) REGIONAL INDICATOR SYMBOL LETTER V. It’s the flag for Cape Verde (CV).

Finally, a glyph is a picture of a character (or grapheme). Two or more characters can share the same glyph (e.g. LATIN CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA), and one character can have many glyphs (think fonts, e.g., A A A $\mathcal{A}$ $\mathscr{A}$).

Exercise: Define the terms character, grapheme, and glyph in your own words.

Exercise: The following three characters can be represented by the same glyph: LATIN SMALL LETTER O WITH STROKE, DIAMETER SIGN, EMPTY SET. Draw such a glyph.

Exercise: What two characters would you use to make the Spanish flag?

Character Sets

A character set has two parts: (1) a repertoire, which is a set of characters, and (2) a code position mapping, which is a function mapping non-negative integers to characters in the repertoire. When an integer $i$ maps to a character $c$ we say $i$ is the code point of $c$.

Unicode

An example character set is Unicode. Here is part of its code point mapping (note that code points are traditionally written in hex):

   25 PERCENT SIGN
   2C COMMA
   54 LATIN CAPITAL LETTER T
   5D RIGHT SQUARE BRACKET
   B0 DEGREE SIGN
   C9 LATIN CAPITAL LETTER E WITH ACUTE
  2AD LATIN LETTER BIDENTAL PERCUSSIVE
  39B GREEK CAPITAL LETTER LAMDA
  446 CYRILLIC SMALL LETTER TSE
  543 ARMENIAN CAPITAL LETTER CHEH
  5E6 HEBREW LETTER TSADI
  635 ARABIC LETTER SAD
  71D SYRIAC LETTER YUDH
  784 THAANA LETTER BAA
  94A DEVANAGARI VOWEL SIGN SHORT O
  9D7 BENGALI AU LENGTH MARK
  BEF TAMIL DIGIT NINE
  D93 SINHALA LETTER AIYANNA
  F0A TIBETAN MARK BKA- SHOG YIG MGO
 11C7 HANGUL JONGSEONG NIEUN-SIOS
 1293 ETHIOPIC SYLLABLE NAA
 13CB CHEROKEE LETTER QUV
 2023 TRIANGULAR BULLET
 20A4 LIRA SIGN
 20B4 HRYVNIA SIGN
 2105 CARE OF
 213A ROTATED CAPITAL Q
 21B7 CLOCKWISE TOP SEMICIRCLE ARROW
 2226 NOT PARALLEL TO
 2234 THEREFORE
 2248 ALMOST EQUAL TO
 265E BLACK CHESS KNIGHT
 30FE KATAKANA VOICED ITERATION MARK
 4A9D HAN CHARACTER LEATHER THONG WOUND AROUND THE HANDLE OF A SWORD
 7734 HAN CHARACTER DAZZLED
 99ED HAN CHARACTER TERRIFY, FRIGHTEN, SCARE, SHOCK
 AAB9 TAI VIET VOWEL UEA
1201F CUNEIFORM SIGN AK TIMES SHITA PLUS GISH
1D111 MUSICAL SYMBOL FERMATA BELOW
1D122 MUSICAL SYMBOL F CLEF
1F08E DOMINO TILE VERTICAL-06-01
1F001 SQUID
1F0CE PLAYING CARD KING OF DIAMONDS
1F382 BIRTHDAY CAKE
1F353 STRAWBERRY
1F4A9 PILE OF POO

Because characters can have multiple glyphs, Unicode lets you represent characters unambiguously with U+ followed by four to six hex digits (e.g. U+00C9, U+1D122).

How many characters are there?

How many characters are possible with Unicode? The powers that be have declared that the highest code point they will ever map is 10FFFF, so it seems like 1,114,112 code points are possible. However, they also declared they would never map characters to code points D800..DFFF, FDD0..FDEF, {0..F}FFFE, {0..F}FFFF, 10FFFE, 10FFFF (2114 code points). So the maximum number of possible characters is 1,111,998.

How many characters have been assigned (so far)? See this cool table at Wikipedia to see how many characters have actually been mapped in each version of Unicode. (BTW, if you don’t peek, Unicode Version 15.1 has 149,813 characters mapped, so there is a lot of room to grow.)

Where can I find all of them?

Please see the complete and up-to-date code charts, with example glyphs of every character. If you would like a much easier way to browse the characters (and why wouldn’t you?), check out the beautiful charbase.com and codepoints.net. (Charbase and Codepoints are run by volunteers, so they may not be up to date.)

Blocks

The code points are not assigned haphazardly. They are allocated into blocks. There are around 300 blocks. Make sure to see the official blocks file at Unicode.org. If you need a little preview, here is a little slice of that file:

0000..007F     Basic Latin
0080..00FF     Latin-1 Supplement
0250..02AF     IPA Extensions
0400..04FF     Cyrillic
0F00..0FFF     Tibetan
2200..22FF     Mathematical Operators
2700..27BF     Dingbats
3100..312F     Bopomofo
10860..1087F   Palmyrene
12000..123FF   Cuneiform
1D100..1D1FF   Musical Symbols
1F000..1F02F   Mahjong Tiles
F0000..FFFFF   Supplementary Private Use Area-A
100000..10FFFF Supplementary Private Use Area-B

Code	Description	Examples
Lu	Letter, Uppercase	U+0059 LATIN CAPITAL LETTER Y U+048C CYRILLIC CAPITAL LETTER SEMISOFT SIGN U+10C1 GEORGIAN CAPITAL LETTER HE U+1041C DESERET CAPITAL LETTER THEE
Ll	Letter, Lowercase	U+00EE LATIN SMALL LETTER I WITH CIRCUMFLEX U+03CE GREEK SMALL LETTER OMEGA WITH TONOS U+A755 LATIN SMALL LETTER P WITH SQUIRREL TAIL U+ABBC CHEROKEE SMALL LETTER WO
Lt	Letter, Titlecase	U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z U+1FFC GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
Lm	Letter, Modifier	U+02C9 MODIFIER LETTER MACRON U+0971 DEVANAGARI SIGN HIGH SPACING DOT U+3031 VERTICAL KANA REPEAT MARK U+16F98 MIAO LETTER TONE-7
Lo	Letter, Other	U+01C2 LATIN LETTER ALVEOLAR CLICK U+06BD ARABIC LETTER NOON WITH THREE DOTS ABOVE U+2D85 ETHIOPIC SYLLABLE BOA U+A42E YI SYLLABLE JJYX
Nd	Number, Decimal Digit	U+0038 DIGIT EIGHT U+0B68 ORIYA DIGIT TWO U+1815 MONGOLIAN DIGIT FIVE U+1D7E0 MATHEMATICAL DOUBLE-STRUCK DIGIT EIGHT
Nl	Number, Letter	U+16EF RUNIC TVIMADUR SYMBOL U+2166 ROMAN NUMERAL SEVEN U+3028 HANGZHOU NUMERAL EIGHT U+10173 GREEK ACROPHONIC DELPHIC FIVE MNAS
No	Number, Other	U+00BD VULGAR FRACTION ONE HALF U+137A ETHIOPIC NUMBER NINETY U+2472 CIRCLED NUMBER NINETEEN U+10120 AEGEAN NUMBER EIGHT HUNDRED
Sm	Symbol, Math	U+00D7 MULTIPLICATION SIGN U+0608 ARABIC RAY U+221E INFINITY U+2978 GREATER-THAN ABOVE RIGHTWARDS ARROW
Sc	Symbol, Currency	U+00A5 YEN SIGN U+20A9 WON SIGN U+20AC EURO SIGN
Sk	Symbol, Modifier	U+00B4 ACUTE ACCENT U+02F9 MODIFIER LETTER BEGIN HIGH TONE U+FBC0 ARABIC SYMBOL SMALL TAH ABOVE
So	Symbol, Other	U+00A9 COPYRIGHT SIGN U+1391 ETHIOPIC TONAL MARK DERET U+230D BOTTOM LEFT CROP U+23E6 AC CURRENT U+2677 RECYCLING SYMBOL FOR TYPE-5 PLASTICS U+2827 BRAILLE PATTERN DOTS-1236
Pc	Punctuation, Connector	U+005F LOW LINE U+2040 CHARACTER TIE
Pd	Punctuation, Dash	U+002D HYPHEN-MINUS U+2013 EN DASH U+2014 EM DASH U+301C WAVE DASH
Ps	Punctuation, Open	U+0028 LEFT PARENTHESIS U+005B LEFT SQUARE BRACKET U+007B LEFT CURLY BRACKET U+29D8 LEFT WIGGLY FENCE
Pe	Punctuation, Close	U+0029 RIGHT PARENTHESIS U+005D RIGHT SQUARE BRACKET U+007D RIGHT CURLY BRACKET U+0F3D TIBETAN MARK ANG KHANG GYAS U+301B RIGHT WHITE SQUARE BRACKET
Pf	Punctuation, Final quote	U+2019 RIGHT SINGLE QUOTATION MARK U+201D RIGHT DOUBLE QUOTATION MARK
Pi	Punctuation, Initial quote	U+2018 LEFT SINGLE QUOTATION MARK U+201C LEFT DOUBLE QUOTATION MARK
Po	Punctuation, Other	U+0021 EXCLAMATION MARK U+002A ASTERISK U+104F MYANMAR SYMBOL GENITIVE U+203D INTERROBANG
Zl	Separator, Line	U+2028 LINE SEPARATOR
Zp	Separator, Paragraph	U+2029 PARAGRAPH SEPARATOR
Zs	Separator, Space	U+0020 SPACE U+00A0 NO-BREAK SPACE U+2009 THIN SPACE
Cc	Other, Control	U+0000 NULL U+0008 BACKSPACE U+000A LINE FEED (LF) U+0093 SET TRANSMIT STATE U+009C STRING TERMINATOR
Cf	Other, Format	U+00AD SOFT HYPHEN U+0604 ARABIC SIGN SAMVAT U+200B ZERO WIDTH SPACE U+200D ZERO WIDTH JOINER U+200F RIGHT-TO-LEFT MARK U+E007F CANCEL TAG
Co	Other, Private Use	This category is given to code points in the following (inclusive) ranges: E000..F8FF, F0000..FFFD, and 100000..10FFFD.
Cs	Other, Surrogate	The 2048 code points in the range D800..DFFF (inclusive) are called surrogate code points and do not, and will not* map to any character*
Cn	Other, Not Assigned	Many code points are not yet assigned to a character, but 66 of the code points in this group are permanently and forever unassigned. These 66 are FDD0..FDEF (inclusive), FFFE, FFFF, 1FFFE, 1FFFF, 2FFFE, 2FFFF, 3FFFE, 3FFFF, 4FFFE, 4FFFF, 5FFFE, 5FFFF, 6FFFE, 6FFFF, 7FFFE, 7FFFF, 8FFFE, 8FFFF, 9FFFE, 9FFFF, AFFFE, AFFFF, BFFFE, BFFFF, CFFFE, CFFFF, DFFFE, DFFFF, EFFFE, EFFFF, FFFFE, FFFFF, 10FFFE, 10FFFF.
Mc	Mark, Spacing Combining	U+094C DEVANAGARI VOWEL SIGN AU U+0DF3 SINHALA VOWEL SIGN DIGA GAYANUKITTA U+16F7E MIAO VOWEL SIGN NG
Mn	Mark, Nonspacing	U+0301 COMBINING ACUTE ACCENT U+064B ARABIC FATHATAN U+0EB9 LAO VOWEL SIGN UU
Me	Mark, Enclosing	U+20DD COMBINING ENCLOSING CIRCLE U+A670 COMBINING CYRILLIC TEN MILLIONS SIGN

Combining Characters, Modifiers, and Variation Selectors

It’s common to use multiple characters to make up graphemes. This is often done to put marks on letters, modify skin tones, modify genders, and create more complex graphemes. Examples:

ş̌́	U+0073 LATIN SMALL LETTER S U+0327 COMBINING CEDILLA U+030C COMBINING CARON U+0301 COMBINING ACUTE ACCENT
👶	U+1F476 BABY	baby
👶🏻	U+1F476 BABY U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2	baby: light skin tone
👶🏼	U+1F476 BABY U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3	baby: medium-light skin tone
👶🏽	U+1F476 BABY U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4	baby: medium skin tone
👶🏾	U+1F476 BABY U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5	baby: medium-dark skin tone
👶🏿	U+1F476 BABY U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-6	baby: dark skin tone
👩‍⚖️	U+1F469 WOMAN U+200D ZERO WIDTH JOINER U+2696 SCALES U+FE0F VARIATION SELECTOR-16	woman judge
👩🏽‍⚕️	U+1F469 WOMAN U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 U+200D ZERO WIDTH JOINER U+2695 STAFF OF AESCULAPIUS U+FE0F VARIATION SELECTOR-16	woman health worker: medium skin-tone
👨‍🌾	U+1F468 MAN U+200D ZERO WIDTH JOINER U+1F33E EAR OF RICE	man farmer
👩🏿‍🔬	U+1F469 WOMAN U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 U+200D ZERO WIDTH JOINER U+1F52C MICROSCOPE	woman scientist: dark skin-tone
🧜🏼‍♀	U+1F9DC MERPERSON U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 U+200D ZERO WIDTH JOINER U+2640 FEMALE SIGN	mermaid: medium-light skin tone
🧘‍♂	U+1F9DC PERSON IN LOTUS POSITION U+200D ZERO WIDTH JOINER U+2642 MALE SIGN	man in lotus position
🤽🏾‍♀️	U+1F93D WATER POLO U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 U+200D ZERO WIDTH JOINER U+2640 FEMALE SIGN U+FE0F VARIATION SELECTOR-16	woman playing water polo: medium-dark skin tone

Exercise: Research and describe the function of VARIATION SELECTOR-16. How is it different from VARIATION SELECTOR-15?

All of the Emojis

Unicode.org does provide a full list of all the emojis.

Check out the HTML version, though this is super slow to load. On the plus side, it has images.

Also, chheck out the plain text version. Loads quickly but for emojis your browser cannot show, you don't see any images.

Funsies!

Awesome Codepoints is super cool. You need to check it out.

Normalization

Here are two character sequences:

U+0065 LATIN SMALL LETTER E, U+0300 COMBINING GRAVE ACCENT
U+00E8 LATIN SMALL LETTER E WITH GRAVE

In some sense they should represent the same text, right? In fact, they only differ in that the second sequence has a pre-composed character. Since the character sequences are different, we can make get them to compare equal only if we normalize them. Unicode defines four Normalization Forms.

Name	Stands for	Algorithm
NFD	Canonical Decomposition	Decompose by canonical equivalence, arranging combiners in specific order (gives fully expanded strings, so more space but fastest algorithm)
NFC	Canonical Composition	Decompose then recompose by canonical equivalence (takes longer but gives shorter strings)
NFKD	Compatibility Decomposition	Decompose by compatibility, arranging combiners in specific order
NFKC	Compatibility Composition	Decompose by compatibility then recompose by canonical equivalence

WHAT? Okay, take it slow. Canonical decomposition and composition does the right thing. Compatibility is lossy (one-way) but useful for search: It turns things like ligatures, roman numerals, subscripts, and superscripts into simpler forms. See why it’s great for search?

Example:

Input	NFD	NFC	NFKD	NFKC
U+00F1 LATIN SMALL LETTER N WITH TILDE U+0063 LATIN SMALL LETTER C U+0307 COMBINING DOT ABOVE U+0327 COMBINING CEDILLA U+2077 SUPERSCRIPT SEVEN U+FB01 LATIN SMALL LIGATURE FI U+2168 ROMAN NUMERAL NINE U+2468 CIRCLED DIGIT NINE	n ̃ c ̧ ̇ ⁷ ﬁ Ⅸ ⑨	ñ ç ̇ ⁷ ﬁ Ⅸ ⑨	n ̃ c ̧ ̇ 7 f i I X 9	ñ ç ̇ 7 f i I X 9

Other Character Sets

Unicode is really the only character set you should be working with. However, other character sets exist, and you should probably know something about them.

ISO8859-1

ISO8859-1 is a character set that is exactly equivalent to the first 256 mappings of Unicode. Obviously it doesn’t have enough characters.

ISO8859-2 through ISO8859-16

These 15 charsets also have 256-character repertoires. They all share the same characters in the first 128 positions, but differ in the next 128. Details at http://www.unicode.org/Public/MAPPINGS/ISO8859/.

Windows-1252

This character set, with a repertoire of 256 characters, also known as CP1252, can be found at http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT. It is very close to ISO8859-1. Be careful with this set! Users of Windows systems often unknowingly produce documents with this character set, then forget to specify it when making these documents available on the web or transporting them via other protocols with tend to default to Unicode. Then the end result is annoying. It’s best to avoid this.

ASCII

ASCII is a character set that is exactly equivalent to the first 128 mappings of Unicode. Obviously it doesn’t have enough characters. However it is commonly used, because many Internet protocols require it! It is a common subset of many character sets and something most people can agree on.

Character Encodings

A character encoding specifies how a character (or character string) is encoded in a bit string. Character sets with 256 characters or less have only a single encoding: they encode each character in a single byte (so in some sense the character set and the character encode are about the same). But there are, naturally, many encodings of Unicode. The most important are UTF-32, UTF-16 and UTF-8.

UTF-32

This is the simplest. Just encode each character in 32 bits. The encoding of a character is simply its code point! Couldn’t be more straightforward. Of course, you try to convince people to actually use four bytes per character.

There are actually two kinds: UTF-32BE (Big Endian) and UTF-32LE (Little Endian). Examples:

Unicode Character	UTF-32BE	UTF-32LE
RIGHT SQUARE BRACKET (U+005D)	00 00 00 5D	5D 00 00 00
CHEROKEE LETTER QUV (U+13CB)	00 00 13 CB	CB 13 00 00
MUSICAL SYMBOL F CLEF (U+1D122)	00 01 D1 22	22 D1 01 00

Note that every character sequence can be encoded into a byte sequence, but not every byte sequence can be decoded into a character sequence. For example, the byte sequence CC CC CC CC does not decode to any character because there is not code point CCCCCCCC in Unicode.

UTF-32
The Good: It’s fixed width! Constant-time to find the nth character in a string (provided you care).
The Bad: It’s pretty bloated.

UTF-16

In UTF-16 some characters are encoded in 16 bits and some in 32 bits.

Character Range	Bit Encoding
U+0000 ... U+FFFF	xxxxxxxx xxxxxxxx
U+10000 ... U+10FFFF	let y = X-10000₁₆ in 110110yy yyyyyyyy 110111yy yyyyyyyy

Note that all characters requiring 32-bits have their first 16 bits in the range D800..D8FF. Pretty slick, right? Those code points are never assigned to any character in Unicode.... How perfect, dontcha think?

There are actually two kinds: UTF-16BE and UTF-16LE. Examples:

Unicode Character	UTF-16BE	UTF-16LE
RIGHT SQUARE BRACKET (U+005D)	00 5D	5D 00
CHEROKEE LETTER QUV (U+13CB)	13 CB	CB 13
MUSICAL SYMBOL F CLEF (U+1D122)	D8 34 DD 22	22 DD 34 D8

UTF-16
The Good: Nothing is good about this encoding.
The Bad: Variable width, almost always uses more space than UTF-8, even for East Asian scripts, people.
The Ugly: It’s what JavaScript thinks characters are. Facepalm.

UTF-8

Here’s another variable length encoding.

Character Range	Bit Encoding	Number of Bits
U+0000 ... U+007F	0xxxxxxx	7
U+0080 ... U+07FF	110xxxxx 10xxxxxx	11
U+0800 ... U+FFFF	1110xxxx 10xxxxxx 10xxxxxx	16
U+10000 ... U+1FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	21
~~U+200000 ... U+3FFFFFF~~	~~111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx~~	26
~~U+4000000 ... U+7FFFFFFF~~	~~1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx~~	31

Examples:

Unicode Character	UTF-8 Encoding
RIGHT SQUARE BRACKET (U+005D)	5D
LATIN CAPITAL LETTER E WITH ACUTE (U+00C9)	C3 89
CHEROKEE LETTER QUV (U+13CB)	E1 8F 8B
MUSICAL SYMBOL F CLEF (U+1D122)	F0 9D 84 A2

And yeah, you never have to care about big-endian or little-endian with UTF-8.

UTF-8 absolutely rocks. The number of advantages it has is stunning. For example:

ASCII text is unchanged in UTF-8.
Non-ASCII characters are never coded with ASCII characters.
C programmers that have assumed that the octet 00 terminates a string won’t have to change their code (UTF-16 is full of these things!).
Western languages can represent text in about 1.1 bytes per character—big savings over the other UTFs.
When decoding you can always determine if you’ve ended up in the middle of a multibyte encoding (no encoding starts with 10).
The number of leading ones in the first byte tells you how many bytes the character needs.
Encoding and decoding are done with shifts and logical bitmask operations, no division, so processing is very fast.
You can (lexicographically) sort by simply pretending each byte is a character.
The octets FE and FF never appear, so mass confusion about byte-order marks and end-of-file-markers improperly processed by lazy C programmers never occurs.

All character sequences can be encoded as UTF-8 byte sequences, but many byte sequences cannot be decoded as UTF-8. Given a 32-bit unsigned integer, you can use this pseudocode to get its (1–4) bytes:

void to_utf_8(uint32 c) {
    if (c > 0x10FFFF) {
        throw new Error('Code point too large');
    } else if (c <= 0x7F) {
        emit_byte(c);
    } else if (c <= 0x7FF) {
        emit_byte(0xC0 | c>>6);
        emit_byte(0x80 | c & 0x3F);
    } else if (c <= 0xFFFF) {
        emit_byte(0xE0 | c>>12);
        emit_byte(0x80 | c>>6 & 0x3F);
        emit_byte(0x80 | c & 0x3F);
    } else {
        emit_byte(0xF0 | c>>18);
        emit_byte(0x80 | c>>12 & 0x3F);
        emit_byte(0x80 | c>>6 & 0x3F);
        emit_byte(0x80 | c & 0x3F);
    }
}

Exercise: This admits too many characters. Modify this code to throw an exception for the 66 permanently unassigned code points as well.

Here’s Tom Scott’s overview of UTF-8:

You should read the UTF-8 Everywhere Manifesto.

Exercise: For each of the following characters, give their encodings (where possible) in UTF-32BE, UTF-16BE, UTF-8, ISO8859-1 and Windows-1252:

COFFIN
BACKSPACE
MOON VIEWING CEREMONY
MEETEI MAYEK LETTER GOK
LATIN SMALL LETTER O WITH STROKE
CJK UNIFIED IDEOGRAPH-6CE0
END OF SELECTED AREA

Do these encodings by hand. The purpose of this practice is to develop understanding.

How Can You Determine The Encoding?

A huge and annoying mistake people make is giving people text and not telling others what encoding it is in. For example, if you type in the text:

“Olé”™

into a text editor that encodes your text in UTF-8, then you have stored the data:

E2 80 9C 4F 6C C3 A9 E2 80 9D E2 84 A2

Now if you didn’t know the encoding and you just assumed the data was encoded in Windows-1252, you would decode the bytes as follows:

â€œOlÃ©â€�â„¢

Actually that was cheating a little bit: that � isn’t really correct; it’s a placeholder because there is no character in Windows-1252 for the code point 9D. Oh, well. Anyway, conversely, suppose you started with “Olé”™ and encoded this in Windows-1252. This gives the byte sequence:

93 4F 6C E9 94 99

If you tried to decode this as a UTF-8 character sequence you would get an error. Decoding as a UTF-16BE sequence you would get:

鍏泩钙

IMPORTANT

When transmitting textual data, always specify the encoding.

But how?

Mutual agreement. You can just agree, implicitly, upon an encoding. Usually people agree on UTF-8. In fact, there is a manifesto that says everyone should just use UTF-8 all the time, everywhere.
In metadata, separate from the text. Many protocols separate metadata, or headers, from a payload. The metadata can contain the encoding to use for the payload. (But yeah, in this case you have to agree beforehand on how the metadata is encoded....)
In the data itself (woah). The character U+FEFF ZERO WIDTH NO-BREAK SPACE, alternately named BYTE ORDER MARK, or BOM, is allowed to appear at the start of a text stream. If it does, it can be used by the stream consumer to deduce the Unicode encoding. For example, if a stream starts with
- 00 00 FE FF then the stream is most likely UTF-32BE.
- FF FE 00 00 then the stream is most likely UTF-32LE.
- FE FF then the stream is most likely UTF-16BE.
- FF FE then the stream is most likely UTF-16LE.
- EF BB BF then the stream is most likely UTF-8. Note: The Unicode Standard permits the BOM, but does not require it, nor does it recommend it.

Other Encodings

I won’t describe any others here, but UTF-7 is worth mentioning. If you like the stuff on this page see the IANA Charsets Page. You may also want to check out the UTF page at czybrra.com, which is very complete and well-written (and from which I borrowed the list of UTF-8 advantages).

Programming with Strings

How hard can this be? Do you know the answers to these two simple questions?

What is the length of a string? Oy. Maybe it is the number of graphemes? Or the number of characters? Or the number of bytes when encoded as UTF-8? Or the number of code units in a UTF-16 encoding? Does the BOM count at all? Oh and wait: do we count pre-composed characters as one character or multiple characters or.... AAAAAAAAHHHHHHHHHHHHH.
When are two strings equal? Yes there is the whole nastiness of whether we have reference equality (same string object) vs. value equality (same value), but what the heck does it mean to compare strings by value? Should we normalize first? If so, which normalization format should be use?

Well, it’s hard. But if you’ve read this far, you at least are now better able to debug problems.

But wait, think about this a bit. Maybe the fact that these questions are hard means these are the wrong questions. Really. Consider:

Why do you care how long a string is? You should really only care about (1) how many bytes of storage are needed for it, or (2) how many pixels wide your the rendered string is going to take up. The number of characters is usually irrelevant!
Why are you comparing strings for equality anyway? Are you checking passwords? YOU BETTER BE HASHING THOSE, so then you will be comparing byte sequences! Are you looking up dictionary keys, or doing a search? In these cases, printable, normalized characters will be just fine. (Use normalization when building the search index and when parsing the query string.)

Aren’t these problems with strings widely known? Well, Edaqa Motoray has written about this. He says programming languages dont’t need a string type at all. And in fact, if your language does have one, it is probably badly broken. It helps, if you want to get a handle on all this, that strings and text are not the same.