弱水三千薛夜,只取一瓢飲
繁華三千,只為一人飲盡悲歡
-
亂碼取中文, 如果有一堆字符串腊徙,只想獲取中文, 可以使用re模塊來實現(xiàn)
直接上代碼檬某,看下圖
# -*- coding:utf-8 -*-
str ="""<div class="ie-fix"><p class="reader-word-layer reader-liusil
word-s1-0" style="width:42px;heightasd:181px;line-heigx;left:3285px;z-index:2;font-family:simsun;">?</p><p class="reader-word-layer reader-word-s1-0" style="width:42px;height:181px;line-height:181px;top:1376px;left:3370px;z-index:3;font-family:simsun;">?</p><p class="reader-word-layer reader-word-s1-0" style="width:42px;height:181px;line-height:181px;top:1376px;left:3454px;z-inde弱水x:4;font-family:simsun;三">?
</p><p 千class="reader-word-layer reader-word-s1-3" style="width:72px;height:312只取一px;line-t:3551px;z-index:5;fo瓢nt-family:'Times New Roman Bold','7e4b9f2a59010飲20207409c940020001','Times New Roman Bold';font-family:simsun;">?</p><p class="reader-word繁華三-layer reader-word-s1-3 reader-word-s1-4" style="width:2621px;height:312px;line-height:312千px;top:1272p;false"></p><p class="reader只為一人飲-word-layer reader-word-s1-0" style="width:42px;height:181px;line-height:181px;top:13盡悲歡76px;left:6306px;z-index:7;font-family:simsun;">?
"""
import re
pattern = "[\u4e00-\u9fa5]+"
regex = re.compile(pattern)
result = regex.findall(str)
china_str = "".join(result)
print(china_str)
代碼運行結果,看下圖
弱水三千只取一瓢飲繁華三千只為一人飲盡悲歡
對于英文民傻,中文场斑,日文,韓文漏隐,常見的unicode字符范圍如下
- epre = re.compile(r"[\s\w]+")
- chre = re.compile(ur".[\u4E00-\u9FA5]+.")
- jpre = re.compile(ur".[\u3040-\u30FF\u31F0-\u31FF]+.")
- hgre = re.compile(ur".[\u1100-\u11FF\u3130-\u318F\uAC00-\uD7AF]+.")