作為一個(gè)爬蟲(chóng)工程師抖誉,詞頻統(tǒng)計(jì)還是要有所了解的殊轴,對(duì)于輿情的文本處理,統(tǒng)計(jì)每個(gè)詞出現(xiàn)的次數(shù)袒炉,亦或是統(tǒng)計(jì)文本出現(xiàn)top10詞旁理,為以后簡(jiǎn)單的數(shù)據(jù)分析,做一點(diǎn)點(diǎn)準(zhǔn)備我磁。那么我們開(kāi)始來(lái)處理吧孽文。
import re
text = '''Which year will be the turning point for the world's most populous country in which its population experiences negative growth? Chinese demographers differ in their answers.
Experts with the Chinese Academy of Social Sciences estimate the turning point could arrive around 2028 after the population peaks to 1.44 billion, says the Green Book of Population and Labor co-released by the Chinese Academy of Social Sciences and Social Sciences Academic Press on Thursday.
However, Huang Wenzheng, a demographics expert, told the Global Times on Friday that this estimate is too optimistic. He estimated the year 2024 or 2025 will be the threshold for population negative growth.
According to Huang, the prediction in the green book is based on the fertility rate that could remain at 1.6, which is hard to realize.
In 2016, China's fertility rate was 1.7, but in 2017, the number of births was less, according to media reports.
The births in 2016 and 2017 were high compared to years before, said Huang. "This was due to the introduction of two-child policy for all families [in 2016] which encouraged those who had the willingness to have a second child before the policy. So they hastened to give birth in these two years."
"But the overall trend is that people are no longer willing to have more children."
Huang elaborated that people's concept of raising children has changed. Urban people care about quality, rather than quantity. "They want to provide the best resources they have to bring up their children. This won't be possible if they have several," he said.
With rapid urbanization, many people from rural areas come to work in the city and also follow this practice.
"Previously people thought that having two or three children is normal. But now they are accustomed to having only one child. They find this normal," Huang said.
Yi Fuxian, a research fellow at the University of Wisconsin-Madison, holds a more pessimistic view. He told the Global Times that 2018 has seen negative growth based on his own research and analysis.
Both Yi and Huang believe that China will abandon the two-child policy this year, putting an end to family planning, in order to stimulate births. They also warned that the sharp decline in population could have negative influence on the economy.
China has introduced a series of new measures to stimulate fertility. This year, the country's tax cuts also favor families with children. Families are able to deduct 12,000 yuan ($1,748) a year from their taxable income for children's education.
Huang said this is still far from enough. He suggested the government provide free upbringing of children aged 0 to 3 and make kindergarten education compulsory to further ease the burden of educating children.
'''
# 詞頻統(tǒng)計(jì)
def word_count(string):
if isinstance(string, str):
new_text = string.strip()
str_list = re.split('\s+', new_text)
word_dict = {}
for str_word in str_list:
if str_word in word_dict.keys():#如果key存在則value加1
word_dict[str_word] = word_dict[str_word] + 1
else:
word_dict[str_word] = 1
return word_dict
else:
raise 'Please enter a string'
word = word_count(string=text)
#print(word)
# 詞頻統(tǒng)計(jì)按降序排序取前10
word_list = sorted(word .items(), key=lambda x: x[1], reverse=True)[0:11]
print(word_list)
如上圖統(tǒng)計(jì)文本top10詞匯出現(xiàn)的詞語(yǔ)驻襟,以及次數(shù)。
以上是英文詞頻統(tǒng)計(jì)芋哭,下面我們看看中文文本怎么統(tǒng)計(jì)吧沉衣。
首先中文統(tǒng)計(jì)我們需要下載一個(gè)第三方庫(kù)jieba分詞。
安裝 pip install jieba
處理文本分詞
import jieba
content_text ='''然而减牺,我們并沒(méi)有時(shí)間去探索數(shù)據(jù)集中的數(shù)千個(gè)案例豌习。我們應(yīng)該做的則是在測(cè)試案例的典型范例上繼續(xù)運(yùn)行LIME,看看哪些詞的占有率仍能位居前列拔疚。通過(guò)這種方法肥隆,我們可以獲得像以前模型那樣的單詞的重要性分?jǐn)?shù),并驗(yàn)證模型的預(yù)測(cè)'''
def get_(string):
b = list(jieba.cut(string, cut_all=True))
dict = {}
for str in b:
if str != '' and str != '\n':#去除空白字符稚失,和換行符栋艳。
if str in dict.keys():
dict[str] = dict[str] + 1
else:
dict[str] = 1
return dict
word = get_(string=content_text )
#取前十top10詞匯
word_list = sorted(word .items(), key=lambda x: x[1], reverse=True)[0:11]
print(word_list)
這是中文版詞頻統(tǒng)計(jì)結(jié)果截圖。
好了墩虹,今天小結(jié)到這里就完了嘱巾,有興趣的小伙伴,可以私信我诫钓,