HTML & CSS Wiki
Advertisement

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set, but unlike them it has the special property of being backwards-compatible with ASCII. For this reason, it is steadily becoming the dominant character encoding for files, e-mail, web pages, and software that manipulates textual information.

UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes). The first 128 characters of the Unicode character set (which correspond directly to the ASCII) use a single octet with the same binary value as in ASCII.

The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.


The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive '1' bits followed by a '0' bit to indicate its type. The remaining bits are concatenated to get the Unicode code point.

Code point Binary code point UTF-8 bytes Example
U+0000 to
U+007F
0xxxxxxx 0xxxxxxx '$' U+0024
= 00100100
00100100
0x24
U+0080 to
U+07FF
00000yyy yyxxxxxx 110yyyyy
10xxxxxx
'¢' U+00A2
= 00000000 10100010
11000010 10100010
0xC2 0xA2
U+0800 to
U+FFFF
zzzzyyyy yyxxxxxx 1110zzzz
10yyyyyy
10xxxxxx
'€' U+20AC
= 00100000 10101100
11100010 10000010 10101100
0xE2 0x82 0xAC
U+010000 to
U+10FFFF
000wwwzz zzzzyyyy yyxxxxxx 11110www
10zzzzzz
10yyyyyy
10xxxxxx
'𤭢' U+024B62
= 00000010 01001011 01100010
11110000 10100100 10101101 10100010
0xF0 0xA4 0xAD 0xA2

So the first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.

By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the Universal Character Set). However, UTF-8 was restricted by RFC 3629 (Note: IETF doesn't define UTF-8, Unicode does) to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003.


This page uses Creative Commons Licensed content from Wikipedia (view authors).
Advertisement