UTF-8 Guide, Meaning , Facts, Information and Description
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Rob Pike and Ken Thompson. It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit mail systems.It uses 1 to 4 bytes per character, depending on the Unicode symbol. For example, only one UTF-8 byte is needed to encode the 128 US-ASCII characters in the Unicode range U+0000 to U+007F.
While it may seem inefficient to represent Unicode characters with as many as 4 bytes, UTF-8 allows legacy systems to transmit this ASCII superset. Additionally, data compression can still be performed independently of the use of UTF-8.
The IETF requires all Internet protocols to identify the encoding used for character data with UTF-8 as at least one supported encoding.
| Table of contents |
|
2 Modified UTF-8 3 Rationale behind UTF-8's mechanics 4 Advantages 5 Disadvantages 6 History 7 External links |
In summary, a Unicode character's bits are divided into several groups, which are then divided among the lower bit positions inside the UTF-8 bytes.
Characters smaller than 128dec are encoded with a single byte that contains their value: these correspond exactly to the 128 7-bit ASCII characters.
In other cases, up to 4 bytes are required. The uppermost bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters. Particularly characters lower than 32dec traditionally called control characters, e.g. carriage return).
For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:
Description
UTF-8 is currently standardized as RFC 3629 (UTF-8, a transformation format of ISO 10646).
Code range
hexadecimalUTF-16
UTF-8
binaryNotes
000000 - 00007F
00000000 0xxxxxxx
0xxxxxxx
ASCII equivalence range; byte begins with zero
000080 - 0007FF
00000xxx xxxxxxxx
110xxxxx 10xxxxxx
first byte begins with 110 or 1110, the following byte(s) begin with 10
000800 - 00FFFF
xxxxxxxx xxxxxxxx
1110xxxx 10xxxxxx 10xxxxxx
010000 - 10FFFF
110110xx xxxxxxxx
110111xx xxxxxxxx11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8
So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes, and additional characters are encoded in 4 bytes. (An earlier UTF-8 specification allowed even higher code points to be represented, using 5 or 6 bytes, but this is no longer supported.)
In fact, UTF-8 is able to use a sequence of up to six bytes and cover the whole area 0x00-0x7FFFFFFF (31 bits), but UTF-8 was restricted by RFC 3629 to only use the area covered by the formal Unicode definition, 0x00-0x10FFFF, in November 2003. Before this, only the bytes 0xFE and 0xFF did not occur in a UTF-8 encoded text. After this limit was introduced, the number of unused bytes in a UTF-8 stream increased to 13 bytes: 0xC0, 0xC1, 0xF5-0xFF. Even though this new definitition limits the available encoding area severely, the problem with overlong sequences (different ways of encoding the same character, which can be a security risk) is eliminated, because an overlong sequence will contain some of these bytes that are not used and therefore will not be a valid sequence.
The Java programming language uses an encoding that is a non-standard modification of UTF-8. This encoding is known among Java users as Modified UTF-8.
There are two differences between modified and standard UTF-8. The first difference is that the null character (\\u0000) is encoded with two bytes instead of one, specifically as 11000000 10000000. This ensures that there are no embedded nulls in the encoded string, perhaps to address the concern that if the encoded string is processed in a language such as C where a null byte signifies the end of a string, an embedded null would cause the string to be truncated.
The second difference is in how characters outside the BMP are encoded. In standard UTF-8, these characters are encoded using the 4-byte format above. In modified UTF-8, these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence. The reason for this modification is more subtle. In Java, a character is 16 bits; therefore some Unicode characters require two Java characters to represent. This aspect of the language predates the supplementary planes of Unicode; however, it is important for performance as well as backwards compatibility, and is unlikely to change. The modified encoding ensures that an encoded string can be decoded one Java character at a time, rather than one Unicode character at a time. Unfortunately, this also means that characters requiring 4 bytes in UTF-8 require 6 bytes in modified UTF-8.
For a complete specification of the Modified UTF-8 format, see [1]
UTF-8 was first officially presented on the USENIX conference in San Diego January 25-29 1993.
This is an Article on UTF-8. Page Contains Information, Facts Details or Explanation Guide About UTF-8 Modified UTF-8
Rationale behind UTF-8's mechanics
As a consequence of the exact mechanics of UTF-8, the following properties of multi-byte sequences hold:
UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although it is argued that this property adds redundancy to UTF-8-encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently.0.
110 for two-byte sequences; 1110 for three-byte sequences, etc.
10 as their two most significant bits.Advantages
Disadvantages
History
UTF-8 was invented by Ken Thompson on September 2, 1992 on a placemat in a New Jersey diner with Rob Pike. The day after, Pike and Thompson implemented it and updated their Plan 9 operating system to use it throughout. External links
