UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set, but unlike them it has the special property of being backwards-compatible with ASCII. For this reason, it is steadily becoming the dominant character encoding for files, e-mail, web pages, and software that manipulates textual information.
UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes). The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII.
The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.
The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive '1' bits followed by a '0' bit to indicate its type. The remaining bits are concatenated to get the Unicode code point.
Code point | Binary code point | UTF-8 bytes | Example |
---|---|---|---|
U+0000 toU+007F
|
0xxxxxxx
|
0xxxxxxx
|
'$' U+0024 = 00100100 → 00100100 → 0x24
|
U+0080 toU+07FF
|
00000yyy yyxxxxxx
|
110yyyyy
|
'¢' U+00A2 = 00000000 10100010 → 11000010 10100010 → 0xC2 0xA2
|
U+0800 toU+FFFF
|
zzzzyyyy yyxxxxxx
|
1110zzzz
|
'€' U+20AC = 00100000 10101100 → 11100010 10000010 10101100 → 0xE2 0x82 0xAC
|
U+010000 toU+10FFFF
|
000wwwzz zzzzyyyy yyxxxxxx
|
11110www
|
'𤭢' U+024B62 = 00000010 01001011 01100010 → 11110000 10100100 10101101 10100010 → 0xF0 0xA4 0xAD 0xA2
|
So the first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.
By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the Universal Character Set). However, UTF-8 was restricted by RFC 3629 (Note: IETF doesn't define UTF-8, Unicode does) to use only the area covered by the formal Unicode definition, U+0000
to U+10FFFF
, in November 2003.
This page uses Creative Commons Licensed content from Wikipedia (view authors). |