當網(wǎng)頁編碼不規(guī)范無法獲取正確編碼格式的時候可能會出現(xiàn)的一些問題的解決方法:
1.常規(guī)習(xí)慣性的添加
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
2.頁面編碼不是UTF-8沖突的時候處理
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 215: ordinal not in range(128)
# 第三方的模塊
import chardet
import requests
url = 'http://my.oschina.net/u/1188877/blog'
req = requests.get(url, timeout=5, verify=False, allow_redirects=True)
# detect()返回 --> {'confidence': 1.0, 'encoding': 'ascii'} --> 前面是準確率,后面是編碼格式
codes = charder.detect(url)[1]
page_content = req.content.decode(codes, 'ignore').encode('utf-8')
3.可能存在部分網(wǎng)站沒GZIP壓縮的情況下,造成requests自動解壓處理的時候異常問題
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing: incorrect header check',))
4.單獨使用上面幾種處理方法的時候可能會出現(xiàn)單個比較特殊的錯誤蒲跨,所以最好的處理方法是整合上面的所有方案
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# 第三方的模塊
import chardet
import requests
url = 'http://my.oschina.net/u/1188877/blog'
req = requests.get(url, headers={'Accept-Encoding': ''}, timeout=5, verify=False, allow_redirects=True)
codes = charder.detect(url)[1]
page_content = req.content.decode(codes, 'ignore').encode('utf-8')
django提交的POST數(shù)據(jù)寫入含中文編碼錯誤
# 2016-01-18更新
# 網(wǎng)頁ajax post的數(shù)據(jù)為ASCII編碼的畦攘,遇中文處理會報錯
# 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
6.數(shù)據(jù)庫MySQLdb中文亂碼小問題
# 報錯如下
_mysql_exceptions.OperationalError: (2019, "Can't initialize character set utf-8 (path: /usr/share/mysql/charsets/)")
# python
conn = MySQLdb.connect(host, user, passwd, dbname, port, charset="utf-8")
# 粗心問題,查看/usr/share/mysql/charsets/Index.xml文件會發(fā)現(xiàn)并沒有utf-8的編碼慨亲,只有utf8
conn = MySQLdb.connect(host, user, passwd, dbname, port, charset="utf-8")
# 數(shù)據(jù)庫設(shè)置成utf-8们拙,python文件設(shè)置成utf-8沼死,頁面文件設(shè)置成utf-8锡移,over呕童,一般不會亂碼