str.encode()给出意想不到的结果

我一直在玩 python 内置函数,并得到了一些令人困惑的(对我来说)结果。

看看这段代码:

>>> 'ü'.encode()
b'xc3xbc'

为什么xc3xbc返回(十进制的 195 和 188)?如果您查看ascii 表,我们会看到它ü是第 129 个字符。或者,如果您查看此处,我们会看到这ü是第 252 个 Unicode 字符,这就是您从中得到的

>>> ord('ü')
252

那么来自哪里xc3xbc,为什么它被分成两个字节?当你 decode: 时b'xc3xbc'.decode(),它怎么知道这两个字节是一个字符?

回答

在您正在查看的表格上,您正在查看标题为“扩展 ASCII”的部分,在ISO/IEC 8859或 latin1 中更为常见。ASCII 作为字符集,定义了 0 到 127 之间的 7 位字符。latin1 定义了其他 128 个单字节字符,是 ASCII 的扩展。Python 使用 UTF-8,它扩展了 ASCII(因此与它兼容)但与 latin1 不兼容。

字符 ü 具有 Unicode 代码点 0xFC(十进制为 252),并且在使用 UTF-8 时,使用两个字符进行编码。

许多在线 ASCII 表都弄错了。将代码点称为 128 到 255 个 ASCII 字符是不准确的,因为 ASCII 没有声称为这些代码点分配任何值。

  • Yes, the byte string consists of two bytes: 195 and 188 (in hexadecimal, that's 0xC3 and 0xBC). The specifics of how that relates to the code point 252 are really just a bunch of bitwise arithmetic and you can read all about it on [Wikipedia](https://en.wikipedia.org/wiki/UTF-8)
  • UTF-8 efficiently encodes ASCII characters (0-127) as single bytes. Last I checked UNICODE now has characters up to 0x10FFFD. It clearly cannot encode every character uniquely with 1 byte. 3 bytes per character would work, but is inefficient. The most common characters are encoded with one byte, but some bytes have to be reserved to indicate additional bytes follow. 254 could indicate 2 more bytes follow, 255 could indicate 3 additional bytes follow, but that would make the codes 3 and 4 bytes long respectively. The compromise is to use more bytes as multi byte codes prefixes.
  • Exactly. If you pass `'latin-1'` as the argument to `encode`, you'll get what you're after.

以上是str.encode()给出意想不到的结果的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>