python讀取一個utf-8編碼保存的文件榆鼠,第一行為空,然后我用line.strip() == ‘’來判斷是否是空行矢洲,發(fā)現(xiàn)判斷不對璧眠。
line.strip()后, 我發(fā)現(xiàn)顯示的值是‘’, 但為什么與‘’不相等呢读虏?len(line.strip())居然等于3T鹁病!太奇怪了盖桥,顯然不是空值呀灾螃,然后我用repr()這個函數(shù)對結(jié)果進行轉(zhuǎn)義,發(fā)現(xiàn)有值\xef\xbb\xbf揩徊, 那這個值是什么意思呢腰鬼?
EF BB BF是被稱為?Byte order mark?(BOM)的文件標記,用來指出這個文件是UTF-8編碼塑荒。
處理方式見?Reading Unicode file data with BOM chars in Python?的第一個回答熄赡,附下:
There is no reason to check if a BOM exists or not,?utf-8-sig?manages that for you and behaves exactly as?utf-8?if the BOM does not exist:
1. # Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
2. # BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see?utf-8-sig?correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use?utf-8-sig?and not worry about it
所以我在讀取文件時,采用utf-8-sig的方式齿税,在python 2.7中彼硫,代碼如下:
import codecs
with codecs.open(file_path, 'r', 'utf-8-sig') as fh: