[Python basics] day4 – character encoding and decoding

The development of character encoding:

[Binary]
—->ASCII: can only store English or Latin characters, one character occupies one byte (one byte eight bits)
——–>gb2312: can only store more than 6700 English characters, 1980
————>gbk1.0: Save more than 20,000 characters, 1995
—————->gb18030: Save 27,000 Chinese characters , 2000
——————–> [unicode] : Universal code, the expression of Unicode is utf-32, and all stored characters occupy 4 bytes.
————————>unicode: Universal code, the representation of Unicode is utf-16, one character occupies 2 bytes, 65535.
—————————->unicode: Universal code, the representation of Unicode is utf-8: variable length bytes, a English is stored in ASCII code, and one Chinese occupies 3 bytes.

  • encode
  • decodedecode

Application example of encoding and [decoding :]

  • The languages ​​of different countries are different. For example, we make a game in China, and the subtitles in the game are all displayed in Chinese. If the game needs to be launched in Japan and other countries in the later stage of development, the game will be garbled when running abroad. The solution used in the early stage is to add Japanese or Korean packages to the game, but if there are many countries, there will be too many imported packages. In order to solve this problem, the solution we use is the process of encoding and decoding. For example, The international language is English, then we encode the game subtitles of the domestic Chinese version into unicode form, and then foreign countries, such as Japan, want to launch this game, only need to decode the unicode code in the game, and decode it into Japanese. That’s it.
  • That is to say, there is no direct communication between Chinese and Japanese, and it needs to be realized through encoding and decoding.

In python2, the default form of unicode in memory is utf-16

python3 view system default encoding

import sys
print(sys.getdefaultencoding())

result:

utf-8

Process finished with exit code 0

python3 encoding:

import sys
print(sys.getdefaultencoding())

s = 'Tesla' 
s_to_gbk = s.encode( "gbk" )
 print (s)
 print (s_to_gbk)

result:

utf-8
Tesla
b '\xcc\xd8\xcb\xb9\xc0\xad'

Process finished with exit code 0

  • While encoding, it will convert the data into bytes type
  • While decoding is decoding (decoding is decoding into unicode), it will convert the bytes type to a string
  • b = byte = byte type = [0-255]

Leave a Comment

Your email address will not be published. Required fields are marked *