需求背景
由于業(yè)務要求,我需要對數(shù)百萬的短文本進行分詞操作.目前只有jieba在手~需要使用并行分詞,提高效率.
代碼實現(xiàn)
import pandas as pd
import jieba
from multiprocessing import Pool
def segment(text):
seg_list = jieba.cut(text)
return '-'.join(seg_list)
def parallel_segment(df):
with Pool() as pool: # 創(chuàng)建進程池,根據(jù)實際情況設(shè)置進程數(shù)
df['segmented_text'] = pool.map(segment, df['message'])
return df
if __name__ == "__main__":
with open("tmp.txt","r",encoding="utf-8") as f:
data=f.read()
tmp=data.split("\n")
df=pd.DataFrame(tmp)
df.columns=["message"]
df=df[df.message.str.len()>10]
df_segmented = parallel_segment(df)
df_segmented.to_pickle("result.pickle")
在自己的電腦上跑了下,15233,大概花了4.86s.感覺不知道哪里不對...好像有些快.