ASCII, Unicode and UTF-8 finally found an article that fully understands

foreword

The underlying principle of NSDictionary and NSMutableArray (hash table and ring buffer)

Analysis of the whole process of CSRF defense in Django and the mechanism of middleware

Talk about NSInvocation and NSMethodSignature

Detailed JavaScript EventLoop

I usually like to write and read blogs, and I have always been a little confused about coding. I was surfing the Internet this afternoon, and I suddenly wanted to know more.

1.ASCII

We know that inside a computer, all information is ultimately a [binary] value. Each binary bit (bit) has two states of 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte (byte). That is to say, a byte can be used to represent a total of 256 different states, each state corresponds to a symbol, that is, 256 symbols, from 00000000 to 11111111.
In the 1960s, the United States formulated a set of character codes, which uniformly stipulated the relationship between English characters and binary bits. This is called ASCII code and is still used today.

The ASCII code specifies a total of 128 characters of encoding. For example, the space SPACE is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) only occupy the last 7 bits of a byte, and the first bit is uniformly specified as 0.

2. Non-ASCII encoding

English is enough to encode with 128 symbols, but to represent other languages, 128 symbols is not enough. For example, in French, there is a phonetic symbol above the letter, which cannot be represented in ASCII. As a result, some European countries have decided to program new symbols with the highest bits that are idle in bytes. For example, é in French is encoded as 130 (binary 10000010). In this way, the coding system used by these European countries can represent up to 256 symbols.

However, here comes a new problem. Different countries have different letters, so even though they all use a 256-symbol encoding method, they represent different letters. For example, 130 represents é in French encoding, but the letter Gimel (ג) in Hebrew encoding, and another symbol in Russian encoding. But in any case, in all of these encoding methods, the symbols represented by 0–127 are the same, and the only difference is this section of 128–255.

As for the characters of Asian countries, there are more symbols used, and there are as many as 100,000 Chinese characters. One byte can only represent 256 kinds of symbols, which is definitely not enough, and multiple bytes must be used to express one symbol. For example, the common encoding method of Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so theoretically it can represent up to 256 x 256 = 65536 symbols

The issue of Chinese encoding needs to be discussed in a special article, which is not covered in this note. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of GB class has nothing to do with [Unicode and UTF-8 in the following.]

3.Unicode

As mentioned in the previous section, there are multiple encodings in the world, and the same binary number can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise it will be garbled if you use the wrong encoding method to interpret it. Why do emails often appear garbled? It is because the encoding method used by the sender and the recipient is different.

It is conceivable that if there is an encoding that incorporates all the symbols in the world. Each symbol is given a unique code, then the garbled problem will disappear. This is Unicode, as its name implies, an encoding of all symbols.

Unicode is, of course, a large collection, now sized to hold over a million symbols. The encoding of each symbol is different. For example, U+0639 represents the Arabic letter Ain, U+0041 represents the English capital letter A, and U+4E25 represents the Chinese character Yan. For the specific symbol correspondence table, you can query unicode.org , or the special Chinese character correspondence table.

4. The problem with Unicode

It should be noted that Unicode is just a symbol set, it only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the Unicode of Chinese character Yan is the hexadecimal number 4E25, which is converted into a binary number with 15 bits (100111000100101), that is to say, the representation of this symbol requires at least 2 bytes. Represents other larger symbols, which may take 3 bytes or 4 bytes, or even more.

There are two serious problems here. The first question is, how can we distinguish between Unicode and ASCII? How does the computer know that three bytes represent a symbol, rather than three symbols each? The second problem is that we already know that English letters are only represented by one byte. If Unicode uniformly stipulates that each symbol is represented by three or four bytes, then each English letter must be preceded by two To three bytes is 0, which is a huge waste of storage, the size of the text file will be two or three times larger, which is unacceptable

The result of them is: 1) There are multiple storage methods for Unicode, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode could not be generalized for a long time until the advent of the Internet.

5.UTF-8

The popularity of the Internet strongly demands the emergence of a unified coding method. UTF-8 is the most widely used implementation of Unicode on the Internet. Other implementations include UTF-16 (characters are represented by two or four bytes) and UTF-32 (characters are represented by four bytes), but they are rarely used on the Internet. Again, the relationship here is that UTF-8 is one of the implementations of Unicode.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1~4 bytes to represent a symbol, and the byte length varies according to different symbols.
The encoding rules of UTF-8 are very simple, there are only two:

1) For a single-byte symbol, the first bit of the byte is set to 0, and the last 7 bits are the Unicode code of the symbol. So for English letters, UTF-8 encoding and ASCII code are the same.

2) For the symbol of n bytes (n > 1), the first n bits of the first byte are all set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are all set to 10. The remaining unmentioned binary bits are all the Unicode codes of this symbol.
The following table summarizes the encoding rules, the letter x indicates the bits of the available encoding.

According to the above table, interpreting UTF-8 encoding is very simple. If the first bit of a byte is 0, the byte is a character by itself; if the first bit is 1, how many consecutive 1s there are means how many bytes the current character occupies.

Next, take the Chinese character Yan as an example to demonstrate how to implement UTF-8 encoding.

Strict Unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800 – 0000 FFFF), so strict UTF-8 encoding requires three bytes, that is, the format is 1110xxxx 10xxxxxx 10xxxxxx. Then, starting from the last binary bit of strict, fill in the x in the format from back to front in turn, and add 0 to the extra bits. In this way, the strict UTF-8 encoding is 11100100 10111000 10100101, and the conversion to hexadecimal is E4B8A5

6. Conversion between Unicode and UTF-8

Through the example in the previous section, you can see that the strict Unicode code is 4E25, and the UTF-8 code is E4B8A5. The two are different. The conversion between them can be realized by program.

On the Windows platform, the easiest way to convert is to use the built-in notepad applet notepad.exe. After opening the file, click the Save As command in the File menu, and a dialog box will pop up with a code drop-down bar at the bottom.

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.
1) ANSI is the default encoding. It is ASCII encoding for English files and GB2312 encoding for Simplified Chinese files (only for Windows Simplified Chinese version, if it is Traditional Chinese version, Big5 code will be used).
2) Unicode encoding here refers to the UCS-2 encoding method used by notepad.exe, that is, the Unicode code of the character is directly stored in two bytes. This option uses the little endian format.
3) Unicode big endian encoding corresponds to the previous option. I’ll explain what little endian and big endian mean in the next section.
4) UTF-8 encoding, which is the encoding method mentioned in the previous section.

After selecting the “encoding method”, click the “Save” button, and the encoding method of the file will be converted immediately.

As mentioned in the previous section, the UCS-2 format can store Unicode codes (code points not exceeding 0xFFFF). Taking Chinese character Yan as an example, the Unicode code is 4E25, which needs to be stored in two bytes, one byte is 4E and the other byte is 25. When storing, 4E is in the front and 25 is in the back, which is the Big endian method; 25 is in the front and 4E is in the back, which is the Little endian method.
These two odd names come from the British author Swift’s “Gulliver’s Travels”. In the book, there is a civil war in Lilliput, which is caused by people arguing over whether to eat eggs with the big-endian or the little-endian. For this matter, six wars broke out before and after, one emperor lost his life, and another emperor lost his throne.
The first byte first is the “big endian”, and the second byte is the “little endian”.
So naturally, a question arises: how does the computer know which way a certain file is encoded?
The Unicode specification defines that a character representing the encoding sequence is added to the top of each file. The name of this character is “zero width no-break space”, which is represented by FEFF. This is exactly two bytes, and FF is 1 greater than FE.
If the first two bytes of a text file are FE FF, it means that the file adopts the large-end mode; if the first two bytes are FF FE, it means that the file adopts the small-end mode.

8. Examples

Below, give an example.
Open the “Notepad” program notepad.exe, create a new text file, the content is a strict word, and save it in ANSI, Unicode, Unicode big endian and UTF-8 encoding.
Then, use the “hexadecimal function” in the text editing software UltraEdit to observe the internal encoding of the file.
1) ANSI: The encoding of the file is two bytes D1 CF, which is the strict GB2312 encoding, which also implies that GB2312 is stored in a large-end manner.
2) Unicode: The encoding is four bytes FF FE 25 4E, of which FF FE indicates that it is stored in a small-end manner, and the real encoding is 4E25.
3) Unicode big endian: The encoding is four bytes FE FF 4E 25, where FE FF indicates that it is stored in big endian.

4) UTF-8: The encoding is six bytes EF BB BF E4 B8 A5, the first three bytes EF BB BF indicate that this is UTF-8 encoding , the last three E4B8A5 are strict specific encodings, and its storage order is the same as The coding order is the same.

9. References

[The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets]

Talk about Unicode encoding

10. A small question ( How to determine string is ASCII or Unicode? )

11. Supplement

When I was looking at python today, I just saw the coding article and recorded it. The basic birth logic is already there, and other introductions are added.

Now, take a look at the difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 byte, while Unicode encoding is usually 2 bytes.

Letters Aare encoded in ASCII in decimal 65and binary 01000001;

Characters 0are encoded in ASCII, which is decimal 48and binary 00110000. Note that characters '0'and integers 0are different;

You can guess that if you Ause Unicode encoding for ASCII encoding, you only need to add 0 in front. Therefore, Athe Unicode encoding is 00000000 01000001.

A new problem appeared again: if unified into Unicode encoding, the problem of garbled characters disappeared. However, if the text you write is basically all in English, using Unicode encoding requires twice as much storage space as ASCII encoding, which is very uneconomical in terms of storage and transmission.

Therefore, in the spirit of saving, there is an encoding that converts the Unicode encoding into a “variable-length encoding” UTF-8. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes, commonly used English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only very rare characters will be encoded. Encoded into 4-6 bytes. If the text you want to transfer contains a lot of English characters, encoding in UTF-8 can save space:

character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
middle x 01001110 00101101 11100100 10111000 10101101

It can also be found from the above table that UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be regarded as a part of UTF-8 encoding, so a large number of historical legacy software that only supports ASCII encoding can be used in UTF- Continue to work under 8 code

After figuring out the relationship between ASCII, Unicode and UTF-8, we can summarize the working methods of character encoding commonly used in computer systems:

In the computer memory, Unicode encoding is used uniformly, and when it needs to be saved to the hard disk or needs to be transmitted, it is converted to UTF-8 encoding.

When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters and stored in the memory. After editing, the Unicode is converted to UTF-8 and saved to the file when saving:

When browsing the web, the server converts the dynamically generated Unicode content to UTF-8 and transmits it to the browser:

So you see similar <meta charset="UTF-8" />information on the source code of many web pages, indicating that the web page is encoded in UTF-8.

Reference article: Senior article

Leave a Comment

Your email address will not be published. Required fields are marked *