GB2312 - V2EX

chstring = "中文字符"
with open('test.txt','w') as f:
... f.write(chstring)
...
mystring = chstring.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 0: ordinal not in range(128)

问题:
1, string.decode('gbk') 解码后的编码格式？(操作系统默认)
2, 环境默认ascii环境下输入的中文字符, open(filename, 'w')可以正常写入, 而'中文字符'.encode('ascii')报错，所生成文件的编码格式是什么？怎么确定的？

字符

编码

ASCII

20 条回复 • 2015-01-11 22:52:24 +08:00

roricon

2015-01-11 17:17:41 +08:00

这个问题在py2和py3中的回答是不同的。
但一般只有在py2中才会踩到坑，那我就给个py2的答案吧。
1. 解码后的类型是 <unicode>，在写入file中的时候，会默认以<utf-8>来保存。
2.见问题1，'中文字符'.encode 肯定会失败

见下面这段
==========
In [8]: a = '你好'

In [9]: a
Out[9]: '\xe4\xbd\xa0\xe5\xa5\xbd'

In [10]: type(a)
Out[10]: str

In [11]: a.decode('utf-8')
Out[11]: u'\u4f60\u597d'

In [13]: print(a.decode('utf-8'))
你好
==========
实际上如果给一个中文字字符串<str>类型，实际上等同于如下步骤。
a = u'你好'

a = a.encode('utf-8')

不知道这么说是否已经说明白了。

9hills

2015-01-11 17:36:49 +08:00

再次重复在我在多个地方阐述过的话：

Python2 没有任何必要使用 sys.setdefaultencoding('utf8’)。

其他两个问题，你只需要知道 encode是Unicode -> xxx ，decode是 xxx -> Unicode 就好了。。

zhicheng

2015-01-11 18:59:35 +08:00

1，和具体的 codec 有关系，例子中的 gbk 如果用在 str 上应该会返回 unicode 类型，用在 unicode 上应该还是 unicode 。
2，'中文字符'.encode('ascii') 这个是因为 ascii 这个 codec 无法处理 '中文字符' 这个 str 。

所有的 encode 和 decode 都和具体的 codec 有关系，返回的结果即可以是 str 也可以是 unicode 。
比如 'abcd'.encode('hex') 结果还是 str 。

多看看文档。
https://docs.python.org/2/library/codecs.html

loading

2015-01-11 19:05:01 +08:00 via Android

@9hills 为什么我用了就能跑，不用就出错？

Sylv

2015-01-11 19:41:45 +08:00 via iPhone

@zhicheng 返回的结果即可以是 str 也可以是 unicode？不对吧。
应该是：
str.decode() -> unicode
unicode.encode() -> str
str.encode() -> str.decode().encode() -> str
unicode.decode() -> unicode.encode().decode() -> unicode

Sylv

2015-01-11 19:44:38 +08:00 via iPhone

@loading 我也从来没用过 sys.setdefaultencoding，是有更正确的解决办法的。

loading

2015-01-11 19:48:16 +08:00 via Android

@Sylv 求教。

Sylv

2015-01-11 20:02:53 +08:00

@loading 不同报错要视情况解决的。
基本思路是：任何可能是非 ascii 的 str 输入要 decode('utf-8') 为 unicode，内部处理时都用 unicode，要输出时或使用不支持 unicode 的方法时（例如 urllib.unquote）再 encode('utf-8') 为 str。

hahastudio

2015-01-11 20:03:56 +08:00

http://nedbatchelder.com/text/unipain.html

zhicheng

2015-01-11 20:34:18 +08:00

@Sylv 出处？

Sylv

2015-01-11 21:28:00 +08:00 via iPhone

@zhicheng
Codec.encode(input[, errors])
Encodes the object input and returns a tuple (output object, length consumed). While codecs are not restricted to use with Unicode, in a Unicode context, encoding converts a Unicode object to a plain string using a particular character set encoding (e.g., cp1252 or iso-8859-1).

Codec.decode(input[, errors])
Decodes the object input and returns a tuple (output object, length consumed). In a Unicode context, decoding converts a plain string encoded using a particular character set encoding to a Unicode object.

https://docs.python.org/2/library/codecs.html

zhicheng

2015-01-11 21:35:59 +08:00

@Sylv 所以，你看过上面的文字了？

013231

2015-01-11 22:14:09 +08:00

回答:

1. str decode后得到的是unicode类型. unicode类型是对字符的抽象, 你无须关心它内部使用编码方式(可能是UCS-2或UCS-4, 编译时指定).

2. interpreter里输入的文字, 编码方式由sys.stdin.encoding确定. ASCII不能编码汉字, 所以会报错.

Sylv

2015-01-11 22:15:47 +08:00

@zhicheng
有什么问题吗？
那请你举个 str.encode(codecs) -> unicode 或者 unicode.decode(codecs) -> str 的例子。

013231

2015-01-11 22:17:32 +08:00

PS: str是字节组, 对它encode没有意义.
str解码(decode)后得到unicode, unicode编码(encode)后得到str.

Sylv

2015-01-11 22:31:43 +08:00

@zhicheng 我明白你的意思了，你指的是 hex 等非 Unicode 相关的 codecs。我说的是楼主这种处理 Unicode 的情况。受教了。

zhicheng

2015-01-11 22:33:18 +08:00

@013231 不知道有 'abcd'.encode('hex') 和 'abcd'.encode('base64') ?

@Sylv
>>> help(''.encode)
>>> help(''.decode)
>>> help(u''.encode)
>>> help(u''.decode)

自己看下输出？

013231

2015-01-11 22:41:35 +08:00

@zhicheng 你说的是, 我疏忽了. 只考虑了str和unicode转换时的状况.

Sylv

2015-01-11 22:48:19 +08:00

@zhicheng
'abcd'.encode('hex') 和 'abcd'.encode('base64') 输出都是 str，这和 str.encode(Unicode codecs) -> str 一致，没什么特别的。
但我真没见过 str.encode -> unicode 的情况。

zhicheng

2015-01-11 22:52:24 +08:00

@Sylv 没什么不可能的，只是不需要尔已。
>>> import codecs
>>> from t import search_function
>>> codecs.register(search_function)
>>> type(''.encode('foobar'))
<type 'unicode'>