??原來一致湊合著使用Torch中torch.util.data下的數(shù)據(jù)集工具做數(shù)據(jù)處理,但是其中的DataLoader要求樣本的長度是對齊的毛好,而且對不同的數(shù)據(jù)源需要做細(xì)節(jié)處理孽水。
??Torch提供了torchtext.data模塊用來實現(xiàn)文本的處理陕贮,并并結(jié)合中文分詞工具,基本上可以滿足日常的文本處理了甜滨。
??這個主題就是介紹torchtext并入門乐严,主要介紹Field,Example衣摩,Dataset昂验,Vectors的使用,并使用LSTM網(wǎng)絡(luò)做了一個文本分類的例子。實際torchtext還是很彪悍的工具模塊凛篙。
tortext 模塊結(jié)構(gòu)
-
torchtext模塊包含文本數(shù)據(jù)處理與文本數(shù)據(jù)集
- 文本數(shù)據(jù)處理
- torchtext.data
- torchtext.data.utils
- torchtext.data.functional
- torchtext.data.metrics
- torchtext.vocab
- torchtext.utils
- 文本數(shù)據(jù)集
- torchtext.datasets
- torchtext.experimental.datasets
- examples
- 文本數(shù)據(jù)處理
-
注意:
- 這里先從
torchtext.data
開始使用TorchText黍匾。
- 這里先從
torchtext.data結(jié)構(gòu)
-
torchtext.data
- Dataset, Batch, and Example
- Fields
- Iterators
- Pipeline
- Functions
-
文本處理的核心模式:
- Dataset指定文本數(shù)據(jù)源;
- Field指定處理字段呛梆;
- Iterator遍歷數(shù)據(jù)集锐涯;
-
下面是TorchText的使用模式示意圖
- 參考鏈接:
http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
- 參考鏈接:
TorchText使用模式示意圖
TorchText使用例子
- 下面我們從一個例子來說明TorchText的使用模式。
- 環(huán)境安裝
- 數(shù)據(jù)源
- 定義字段Field
- 構(gòu)建數(shù)據(jù)集
- 構(gòu)建批次數(shù)據(jù)
- 詞向量與構(gòu)建詞表
- 使用數(shù)據(jù)集
環(huán)境安裝
- 安裝torchtext
pip install torchtext
- 注意:
- 因為bug的緣故填物,建議采用直接在github安裝修正版:
pip install https://github.com/pytorch/text/archive/master.zip
- 因為bug的緣故填物,建議采用直接在github安裝修正版:
安裝torchtext
- 可選安裝1 - 分詞工具
pip install spacy
python -m spacy download en
- spacy官網(wǎng):
https://spacy.io/models/
Spacy分詞工具
- 注意:
- 安裝訓(xùn)練庫會因為網(wǎng)絡(luò)緣故纹腌,無法下載,可以使用如下安裝:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
- 也可以直接下載滞磺,并安裝(這兒采用的升薯。)。
- 安裝訓(xùn)練庫會因為網(wǎng)絡(luò)緣故纹腌,無法下載,可以使用如下安裝:
安裝庫
- 可選安裝2 - 分詞工具
pip install sacremoses
安裝sacremoses
- 安裝 -分詞工具
- 結(jié)巴分詞
pip install jieba
數(shù)據(jù)源
-
下載地址
https://github.com/bigboNed3/chinese_text_cnn
-
下載的文件:
- 訓(xùn)練集:
train.tsv
- 測試集:
test.tsv
- 驗證集:
dev.tsv
- 訓(xùn)練集:
數(shù)據(jù)源文件
-
注意:
- 文件也可以使用其他方式存儲击困,比如text文件涎劈,json文件等。
-
數(shù)據(jù)格式:
- 序號(多余的字段)
- label
- text
數(shù)據(jù)格式
定義字段Field
Field類幫助文檔
- 構(gòu)造Field對象的參數(shù)設(shè)置有兩種方式:
- 在構(gòu)造器中設(shè)置
- 使用屬性設(shè)置(我們后面采用屬性設(shè)置)
from torchtext.data import Field
Field?
?[1;31mInit signature:?[0m
?[0mField?[0m?[1;33m(?[0m?[1;33m
?[0m ?[0msequential?[0m?[1;33m=?[0m?[1;32mTrue?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0muse_vocab?[0m?[1;33m=?[0m?[1;32mTrue?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0minit_token?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0meos_token?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mfix_length?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mdtype?[0m?[1;33m=?[0m?[0mtorch?[0m?[1;33m.?[0m?[0mint64?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mpreprocessing?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mpostprocessing?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mlower?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mtokenize?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mtokenizer_language?[0m?[1;33m=?[0m?[1;34m'en'?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0minclude_lengths?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mbatch_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mpad_token?[0m?[1;33m=?[0m?[1;34m'<pad>'?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0munk_token?[0m?[1;33m=?[0m?[1;34m'<unk>'?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mpad_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mtruncate_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mstop_words?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m ?[0mis_target?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m?[1;33m)?[0m?[1;33m?[0m?[0m
?[1;31mDocstring:?[0m
Defines a datatype together with instructions for converting to Tensor.
Field class models common text processing datatypes that can be represented
by tensors. It holds a Vocab object that defines the set of possible values
for elements of the field and their corresponding numerical representations.
The Field object also holds other parameters relating to how a datatype
should be numericalized, such as a tokenization method and the kind of
Tensor that should be produced.
If a Field is shared between two columns in a dataset (e.g., question and
answer in a QA dataset), then they will have a shared vocabulary.
Attributes:
sequential: Whether the datatype represents sequential data. If False,
no tokenization is applied. Default: True.
use_vocab: Whether to use a Vocab object. If False, the data in this
field should already be numerical. Default: True.
init_token: A token that will be prepended to every example using this
field, or None for no initial token. Default: None.
eos_token: A token that will be appended to every example using this
field, or None for no end-of-sentence token. Default: None.
fix_length: A fixed length that all examples using this field will be
padded to, or None for flexible sequence lengths. Default: None.
dtype: The torch.dtype class that represents a batch of examples
of this kind of data. Default: torch.long.
preprocessing: The Pipeline that will be applied to examples
using this field after tokenizing but before numericalizing. Many
Datasets replace this attribute with a custom preprocessor.
Default: None.
postprocessing: A Pipeline that will be applied to examples using
this field after numericalizing but before the numbers are turned
into a Tensor. The pipeline function takes the batch as a list, and
the field's Vocab.
Default: None.
lower: Whether to lowercase the text in this field. Default: False.
tokenize: The function used to tokenize strings using this field into
sequential examples. If "spacy", the SpaCy tokenizer is
used. If a non-serializable function is passed as an argument,
the field will not be able to be serialized. Default: string.split.
tokenizer_language: The language of the tokenizer to be constructed.
Various languages currently supported only in SpaCy.
include_lengths: Whether to return a tuple of a padded minibatch and
a list containing the lengths of each examples, or just a padded
minibatch. Default: False.
batch_first: Whether to produce tensors with the batch dimension first.
Default: False.
pad_token: The string token used as padding. Default: "<pad>".
unk_token: The string token used to represent OOV words. Default: "<unk>".
pad_first: Do the padding of the sequence at the beginning. Default: False.
truncate_first: Do the truncating of the sequence at the beginning. Default: False
stop_words: Tokens to discard during the preprocessing step. Default: None
is_target: Whether this field is a target variable.
Affects iteration over batches. Default: False
?[1;31mFile:?[0m c:\program files\python36\lib\site-packages\torchtext\data\field.py
?[1;31mType:?[0m type
?[1;31mSubclasses:?[0m ReversibleField, NestedField, LabelField, ShiftReduceField, ParsedTextField, BABI20Field
Field類說明
CLASS torchtext.data.Field(
sequential=True, # 是否序列數(shù)據(jù)阅茶, 默認(rèn)值True蛛枚,如果為False,則不需要init_token參數(shù)脸哀。
use_vocab=True, # 是否使用詞袋對象蹦浦,默認(rèn)True,如果指定False撞蜂,則不需要處理這個字典盲镶,表示這個字典默認(rèn)是Numerical。
init_token=None, # 加載每個字段前的處理函數(shù)蝌诡。
eos_token=None, # 加載完每個字段后的處理函數(shù)溉贿。
fix_length=None, # 指定字段的文本長度。
dtype=torch.int64, # 數(shù)據(jù)類型送漠。
preprocessing=None, # 在token后顽照,轉(zhuǎn)換為Numerical之前的處理管道由蘑。
postprocessing=None, # 轉(zhuǎn)換為Numerical之后的處理管道闽寡。
lower=False, # 是否小寫轉(zhuǎn)換。
tokenize=None, # 用來把文本轉(zhuǎn)成序列文本單詞的函數(shù)尼酿,缺省的使用string.split函數(shù)爷狈,如果指定Spacy,就使用Spacy分詞工具裳擎。
tokenizer_language='en', # 指定文本語言涎永,在指定除en意外語言,tokenize必須使用Spacy。
include_lengths=False, # 是否只返回補丁長度羡微,還是返回不定長度與數(shù)據(jù)長度谷饿。返回兩個長度使用元組類型。
batch_first=False, # 是否把批次大小放第一個維度(這是因為LSTM等網(wǎng)絡(luò)模塊對格式的要求)妈倔。
pad_token='<pad>', # 用來做補丁對齊處理的token函數(shù)博投。
unk_token='<unk>', # 出現(xiàn)OOV的處理函數(shù)。OOV(Out-of-vocabulary)就是出現(xiàn)不在詞袋內(nèi)的單詞的處理函數(shù)盯蝴。
pad_first=False, # 補丁對齊的兩種情況毅哗,在前補丁(True)捧挺,在后補堵敲唷(False)
truncate_first=False, # 文本超過長度的截斷方式:丟棄前面(True)與后面(False)
stop_words=None, # 預(yù)處理步驟中需要丟棄的單詞(停用詞)。
is_target=False) # 是否是標(biāo)簽字段闽烙。
構(gòu)建Feild字段的例子
- 根據(jù)上面的數(shù)據(jù)源來構(gòu)建三個字段:
- 索引(無字段名)
- 標(biāo)簽(label)
- 特征(text)
- 構(gòu)建默認(rèn)對象
- 因為索引不是我們需要的數(shù)據(jù)列翅睛,所以該字段不用處理。
from torchtext.data import Field
fld_label = Field()
fld_text = Field()
- 設(shè)置基本屬性
# 標(biāo)簽字段比較簡答
fld_label.sequential = False # 這個屬性默認(rèn)True
fld_label.use_vocab = False # 這個屬性默認(rèn)True
# 特征字段
fld_text .sequential = True # 這個屬性默認(rèn)True
fld_text .use_vocab = True # 這個屬性默認(rèn)True
# 因為sequential為True黑竞,則必須指定分詞屬性token
- 設(shè)置token屬性宏所,指定分詞函數(shù)
- 該函數(shù)的要求:
- 參數(shù):傳入一個樣本的特征(就是text字段)
- 返回:返回一個列表,就是分詞以后的結(jié)果摊溶,這樣字段的數(shù)據(jù)就不是字符串爬骤,而是單詞列表。
- 該函數(shù)的要求:
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
def word_cut(text):
text = regex.sub(' ', text)
return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
構(gòu)建數(shù)據(jù)集
Dataset的幫助文檔
from torchtext.data import Dataset
Dataset?
?[1;31mInit signature:?[0m ?[0mDataset?[0m?[1;33m(?[0m?[0mexamples?[0m?[1;33m,?[0m ?[0mfields?[0m?[1;33m,?[0m ?[0mfilter_pred?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m)?[0m?[1;33m?[0m?[0m
?[1;31mDocstring:?[0m
Defines a dataset composed of Examples along with its Fields.
Attributes:
sort_key (callable): A key to use for sorting dataset examples for batching
together examples with similar lengths to minimize padding.
examples (list(Example)): The examples in this dataset.
fields (dict[str, Field]): Contains the name of each column or field, together
with the corresponding Field object. Two fields with the same Field object
will have a shared vocabulary.
?[1;31mInit docstring:?[0m
Create a dataset from a list of Examples and Fields.
Arguments:
examples: List of Examples.
fields (List(tuple(str, Field))): The Fields to use in this tuple. The
string is a field name, and the Field is the associated field.
filter_pred (callable or None): Use only examples for which
filter_pred(example) is True, or use all examples if None.
Default is None.
?[1;31mFile:?[0m c:\program files\python36\lib\site-packages\torchtext\data\dataset.py
?[1;31mType:?[0m type
?[1;31mSubclasses:?[0m TabularDataset, LanguageModelingDataset, SST, TranslationDataset, SequenceTaggingDataset, TREC, IMDB, BABI20
Dataset屬性說明
- 構(gòu)造器說明:
Dataset(examples, fields, filter_pred=None)
# examples:數(shù)據(jù)列表莫换,類型是Examples列表霞玄。
# fields:字段列表,類型是tuple(str, Field))列表拉岁。
# filter_pred:過濾數(shù)據(jù)集的條件坷剧,類型是可調(diào)用對象或者函數(shù),樣本是否使用喊暖,根據(jù)函數(shù)的返回值確定惫企。True就使用。若為None陵叽,樣本全部使用狞尔。
- 屬性說明:
- sort_key :類型是callable:
- examples:類型list(Example)
- fields:類型dict[str, Field]
構(gòu)建數(shù)據(jù)集的字段
Dataset需要的字段是列表類型:fields (List(tuple(str, Field)))
下面例子是完整的字段的構(gòu)建例子
from torchtext.data import Field
import re
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
# ----------------------------------------------------------------------
# 1. 數(shù)據(jù)集需要的Fields定義:fields (List(tuple(str, Field)))
fld_label = Field()
fld_text = Field()
# 標(biāo)簽字段比較簡答
fld_label.sequential = False # 這個屬性默認(rèn)True
fld_label.use_vocab = False # 這個屬性默認(rèn)True
# 特征字段
fld_text .sequential = True # 這個屬性默認(rèn)True
fld_text .use_vocab = True # 這個屬性默認(rèn)True
# 因為sequential為True,則必須指定分詞屬性token
def word_cut(text):
text = regex.sub(' ', text)
return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
# 構(gòu)建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)] # 兩個字段
fields
[('text', <torchtext.data.field.Field at 0x2a9d2336a20>),
('label', <torchtext.data.field.Field at 0x2a9d2336780>)]
Example幫助文檔
from torchtext.data import Example
help(Example)
Help on class Example in module torchtext.data.example:
class Example(builtins.object)
| Defines a single training or test example.
|
| Stores each column of the example as an attribute.
|
| Class methods defined here:
|
| fromCSV(data, fields, field_to_index=None) from builtins.type
|
| fromJSON(data, fields) from builtins.type
|
| fromdict(data, fields) from builtins.type
|
| fromlist(data, fields) from builtins.type
|
| fromtree(data, fields, subtrees=False) from builtins.type
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
構(gòu)建Example對象
- Example提供一組類函數(shù)實現(xiàn)Example對象構(gòu)建巩掺,所謂的工廠模式就是這個了偏序。
- 參數(shù)需要數(shù)據(jù)與字段描述。
- 數(shù)據(jù)與字段的長度應(yīng)該是對應(yīng)的胖替。
from torchtext.data import Field
from torchtext.data import Example
# 這里使用上面構(gòu)建的fields研儒,上面的fields是否正確豫缨,這個就可以檢測
one_example = Example.fromlist(["我是數(shù)據(jù),很長的數(shù)據(jù)", 1], fields) # 1是標(biāo)簽
one_example
<torchtext.data.example.Example at 0x2a9d22e2ba8>
構(gòu)建Example列表
- Example列表的構(gòu)建需要數(shù)據(jù)源的數(shù)據(jù) 端朵。
- 可以使用[..., ..., ...]構(gòu)建好芭,下面數(shù)據(jù)多,我們使用循環(huán)構(gòu)建冲呢,
- 數(shù)據(jù)是csv格式栓撞,csv得分隔符可以體現(xiàn)在擴展名上。
- csv: Comma-Separated Values碗硬,
- tsv: Tab-Separated Values
import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
# ----------------------------------------------------------------------
# 2. 數(shù)據(jù)集需要的exampls列表(list(Example)):
# 使用pandas讀取csv文件瓤湘,其他方式也可以。比如csv庫恩尾。
data = pd.read_csv("datasets/train.tsv", sep='\t') # csv: Comma-Separated Values弛说,tsv: Tab-Separated Values
examples = []
for txt, lab in zip(data["text"], data["label"]):
one_example = Example.fromlist([txt, lab], fields)
examples.append(one_example)
examples[0:5] # 顯示5個
[<torchtext.data.example.Example at 0x2a9d233c4a8>,
<torchtext.data.example.Example at 0x2a9b5fee9e8>,
<torchtext.data.example.Example at 0x2a9d8fae9e8>,
<torchtext.data.example.Example at 0x2a9d8faea90>,
<torchtext.data.example.Example at 0x2a9d8fae9b0>]
構(gòu)建數(shù)據(jù)集
- 使用Dataset構(gòu)造器構(gòu)建數(shù)據(jù)集
Dataset(examples, fields, filter_pred=None)
from torchtext.data import Dataset
# 這個數(shù)據(jù)集與torch.utils.data的Dataset是有差異的。 torch.utils.data的DataLoader要求數(shù)據(jù)是整齊的翰意,就是每個記錄長度一樣木人。
dataset = Dataset(examples, fields)
dataset
<torchtext.data.dataset.Dataset at 0x2a9b7f1a780>
深入理解數(shù)據(jù)集
- Dataset應(yīng)該提供函數(shù)操作數(shù)據(jù)。下面通過幫助文檔了解冀偶。
- 尤其提供數(shù)據(jù)的變量訪問:
__getitem__(self, i)
__len__(self)
__iter__(self)
-
split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
- 數(shù)據(jù)集切分:訓(xùn)練集 + 測試機
-
filter_examples(self, field_names)
- 根據(jù)字段過濾數(shù)據(jù)字段醒第。
- 尤其提供數(shù)據(jù)的變量訪問:
help(dataset)
Help on Dataset in module torchtext.data.dataset object:
class Dataset(torch.utils.data.dataset.Dataset)
| Defines a dataset composed of Examples along with its Fields.
|
| Attributes:
| sort_key (callable): A key to use for sorting dataset examples for batching
| together examples with similar lengths to minimize padding.
| examples (list(Example)): The examples in this dataset.
| fields (dict[str, Field]): Contains the name of each column or field, together
| with the corresponding Field object. Two fields with the same Field object
| will have a shared vocabulary.
|
| Method resolution order:
| Dataset
| torch.utils.data.dataset.Dataset
| builtins.object
|
| Methods defined here:
|
| __getattr__(self, attr)
|
| __getitem__(self, i)
|
| __init__(self, examples, fields, filter_pred=None)
| Create a dataset from a list of Examples and Fields.
|
| Arguments:
| examples: List of Examples.
| fields (List(tuple(str, Field))): The Fields to use in this tuple. The
| string is a field name, and the Field is the associated field.
| filter_pred (callable or None): Use only examples for which
| filter_pred(example) is True, or use all examples if None.
| Default is None.
|
| __iter__(self)
|
| __len__(self)
|
| filter_examples(self, field_names)
| Remove unknown words from dataset examples with respect to given field.
|
| Arguments:
| field_names (list(str)): Within example only the parts with field names in
| field_names will have their unknown words deleted.
|
| split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
| Create train-test(-valid?) splits from the instance's examples.
|
| Arguments:
| split_ratio (float or List of floats): a number [0, 1] denoting the amount
| of data to be used for the training split (rest is used for test),
| or a list of numbers denoting the relative sizes of train, test and valid
| splits respectively. If the relative size for valid is missing, only the
| train-test split is returned. Default is 0.7 (for the train set).
| stratified (bool): whether the sampling should be stratified.
| Default is False.
| strata_field (str): name of the examples Field stratified over.
| Default is 'label' for the conventional label field.
| random_state (tuple): the random seed used for shuffling.
| A return value of `random.getstate()`.
|
| Returns:
| Tuple[Dataset]: Datasets for train, validation, and
| test splits in that order, if the splits are provided.
|
| ----------------------------------------------------------------------
| Class methods defined here:
|
| download(root, check=None) from builtins.type
| Download and unzip an online archive (.zip, .gz, or .tgz).
|
| Arguments:
| root (str): Folder to download data to.
| check (str or None): Folder whose existence indicates
| that the dataset has already been downloaded, or
| None to check the existence of root/{cls.name}.
|
| Returns:
| str: Path to extracted dataset.
|
| splits(path=None, root='.data', train=None, validation=None, test=None, **kwargs) from builtins.type
| Create Dataset objects for multiple splits of a dataset.
|
| Arguments:
| path (str): Common prefix of the splits' file paths, or None to use
| the result of cls.download(root).
| root (str): Root dataset storage directory. Default is '.data'.
| train (str): Suffix to add to path for the train set, or None for no
| train set. Default is None.
| validation (str): Suffix to add to path for the validation set, or None
| for no validation set. Default is None.
| test (str): Suffix to add to path for the test set, or None for no test
| set. Default is None.
| Remaining keyword arguments: Passed to the constructor of the
| Dataset (sub)class being used.
|
| Returns:
| Tuple[Dataset]: Datasets for train, validation, and
| test splits in that order, if provided.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| sort_key = None
|
| ----------------------------------------------------------------------
| Methods inherited from torch.utils.data.dataset.Dataset:
|
| __add__(self, other)
|
| ----------------------------------------------------------------------
| Data descriptors inherited from torch.utils.data.dataset.Dataset:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
- 數(shù)據(jù)集遍歷方式1
# 數(shù)據(jù)集訪問與遍歷
for i in range(5): #len(dataset)
print(dataset[i])
<torchtext.data.example.Example object at 0x000002A9D233C4A8>
<torchtext.data.example.Example object at 0x000002A9B5FEE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAEA90>
<torchtext.data.example.Example object at 0x000002A9D8FAE9B0>
- 數(shù)據(jù)集遍歷方式2
# 數(shù)據(jù)集訪問與遍歷
for one_ex in dataset:
print(one_ex)
break
<torchtext.data.example.Example object at 0x000002A9D233C4A8>
構(gòu)建批次數(shù)據(jù)
- 數(shù)據(jù)集的數(shù)據(jù)使用迭代器來完成訪問。從上面例子應(yīng)該知道进鸠,從數(shù)據(jù)集無法訪問到具體的數(shù)據(jù)值稠曼,沒有提供訪問的標(biāo)準(zhǔn)接口。
Iterator的幫助文檔
- 使用Iterator類也是兩種方式:
- 構(gòu)造器:
__init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
- 使用類函數(shù)
splits(datasets, batch_sizes=None, **kwargs)
- 構(gòu)造器:
- Iterator提供了數(shù)據(jù)遍歷方式客年,只是遍歷的是批次霞幅。
__iter__(self)
__len__(self)
- 直接返回數(shù)據(jù):
data(self)
from torchtext.data import Iterator
help(Iterator)
Help on class Iterator in module torchtext.data.iterator:
class Iterator(builtins.object)
| Defines an iterator that loads batches of data from a Dataset.
|
| Attributes:
| dataset: The Dataset object to load Examples from.
| batch_size: Batch size.
| batch_size_fn: Function of three arguments (new example to add, current
| count of examples in the batch, and current effective batch size)
| that returns the new effective batch size resulting from adding
| that example to a batch. This is useful for dynamic batching, where
| this function would add to the current effective batch size the
| number of tokens in the new example.
| sort_key: A key to use for sorting examples in order to batch together
| examples with similar lengths and minimize padding. The sort_key
| provided to the Iterator constructor overrides the sort_key
| attribute of the Dataset, or defers to it if None.
| train: Whether the iterator represents a train set.
| repeat: Whether to repeat the iterator for multiple epochs. Default: False.
| shuffle: Whether to shuffle examples between epochs.
| sort: Whether to sort examples according to self.sort_key.
| Note that shuffle and sort default to train and (not train).
| sort_within_batch: Whether to sort (in descending order according to
| self.sort_key) within each batch. If None, defaults to self.sort.
| If self.sort is True and this is False, the batch is left in the
| original (ascending) sorted order.
| device (str or `torch.device`): A string or instance of `torch.device`
| specifying which device the Variables are going to be created on.
| If left as default, the tensors will be created on cpu. Default: None.
|
| Methods defined here:
|
| __init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
| Initialize self. See help(type(self)) for accurate signature.
|
| __iter__(self)
|
| __len__(self)
|
| create_batches(self)
|
| data(self)
| Return the examples in the dataset in order, sorted, or shuffled.
|
| init_epoch(self)
| Set up the batch generator for a new epoch.
|
| load_state_dict(self, state_dict)
|
| state_dict(self)
|
| ----------------------------------------------------------------------
| Class methods defined here:
|
| splits(datasets, batch_sizes=None, **kwargs) from builtins.type
| Create Iterator objects for multiple splits of a dataset.
|
| Arguments:
| datasets: Tuple of Dataset objects corresponding to the splits. The
| first such object should be the train set.
| batch_sizes: Tuple of batch sizes to use for the different splits,
| or None to use the same batch_size for all splits.
| Remaining keyword arguments: Passed to the constructor of the
| iterator class being used.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| epoch
使用splits函數(shù)構(gòu)建Iterator對象
- splits函數(shù)的核心參數(shù)是datasets與batch_size
- datasets:需要list類型;
- batch_size:類型與datasets匹配量瓜;
from torchtext.data import Iterator
print(len(dataset))
it_dataset, = Iterator.splits((dataset, ), batch_sizes=(100, ))
it_dataset, len(it_dataset)
6300
(<torchtext.data.iterator.Iterator at 0x2a9d9c0b0b8>, 63)
詞向量與構(gòu)建詞表
構(gòu)建的Iterator還不能直接工作司恳,因為Iterator的工作需要詞表,通過詞表才能把文本轉(zhuǎn)化為數(shù)值(原理是TF詞頻)
-
構(gòu)建此表兩種方式
- 使用與訓(xùn)練的詞向量绍傲,使用vectors參數(shù)指定
- 使用默認(rèn)的詞向量扔傅,設(shè)置vectors = None
預(yù)訓(xùn)練的詞向量
- 這里我們只關(guān)心中文,英文可以使用spacy與sacremoses
- 下載地址:
https://github.com/Embedding/Chinese-Word-Vectors
- 下載地址:
預(yù)先訓(xùn)練的詞向量
-
下載的詞向量文件
-
700+Mb烫饼,比較刺激的文件猎塞。
詞向量訓(xùn)練文件
-
加載詞向量文件
from torchtext.vocab import Vectors
# 會有一個加載過程。
vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
vectors
0%|
| 0/259922 [00:00<?, ?it/s]Skipping token b'259922' with 1-dimensional vector [b'300']; likely a header
100%|██████████████████████████████████████████████████| 259922/259922 [00:30<00:00, 8568.81it/s]
<torchtext.vocab.Vectors at 0x2a9d9b11ac8>
使用詞向量構(gòu)建詞表
# 文本使用預(yù)先訓(xùn)練的詞向量
fld_text.build_vocab(dataset, vectors = vectors) # 見上面的詞向量
# 標(biāo)簽是整數(shù)枫弟,不用詞向量邢享。
fld_label.build_vocab(dataset)
使用數(shù)據(jù)集
遍歷
- 現(xiàn)在可以使用it_dataset迭代數(shù)據(jù)集了Iterator鹏往。
__iter__(self)
__len__(self)
- 注意:沒有
__item__
函數(shù)淡诗。智能迭代骇塘。
for item in it_dataset:
print(item)
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 54x100]
[.label]:[torch.LongTensor of size 100]
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 48x100]
[.label]:[torch.LongTensor of size 100]
.......
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 54x100]
[.label]:[torch.LongTensor of size 100]
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 53x100]
[.label]:[torch.LongTensor of size 100]
[torchtext.data.batch.Batch of size 100]
[.text]:[torch.LongTensor of size 48x100]
[.label]:[torch.LongTensor of size 100]
取數(shù)據(jù)
- 取文本
for item in it_dataset:
print(item.text) # item.label
tensor([[ 284, 2568, 115, ..., 66, 62, 14],
[1041, 2, 990, ..., 848, 92, 158],
[ 445, 369, 17, ..., 19, 585, 1103],
...,
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1]])
......
tensor([[ 96, 548, 197, ..., 45, 12, 47],
[ 635, 1167, 62, ..., 1036, 1306, 10],
[9668, 14, 14, ..., 357, 1329, 36],
...,
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1]])
- 取標(biāo)簽
for item in it_dataset:
print(item.label)
tensor([0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,
0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
1, 0, 1, 0])
......
tensor([1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
0, 1, 0, 1])
文本分類中的TorchText應(yīng)用
數(shù)據(jù)集處理
函數(shù)封裝
import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
from torchtext.data import Dataset
from torchtext.data import Iterator
from torchtext.vocab import Vectors
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
fld_label = Field()
fld_text = Field()
# 標(biāo)簽字段比較簡答
fld_label.sequential = False # 這個屬性默認(rèn)True
fld_label.use_vocab = False # 這個屬性默認(rèn)True
# 特征字段
fld_text.sequential = True # 這個屬性默認(rèn)True
fld_text.use_vocab = True # 這個屬性默認(rèn)True
fld_text.batch_first=True
# 因為sequential為True,則必須指定分詞屬性token
def word_cut(text):
text = regex.sub(' ', text)
return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
# 構(gòu)建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)] # 兩個字段
def load_data(data_file):
data = pd.read_csv(data_file, sep='\t') # csv: Comma-Separated Values韩容,tsv: Tab-Separated Values
examples = []
for txt, lab in zip(data["text"], data["label"]):
one_example = Example.fromlist([txt, lab], fields)
examples.append(one_example)
dataset = Dataset(examples, fields)
it_dataset, = Iterator.splits((dataset, ), batch_sizes=(1000, )) # 每個批次過大款违,GPU容易溢出
vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
fld_text.build_vocab(dataset, vectors = vectors) # 見上面的詞向量
# 標(biāo)簽是整數(shù),不用詞向量群凶。
fld_label.build_vocab(dataset)
return it_dataset
加載訓(xùn)練集與測試集
- 數(shù)據(jù)集文件說明:
- 訓(xùn)練集:train.tsv
- 驗證集:valid.tsv
it_train = load_data("datasets/train.tsv")
it_valid = load_data("datasets/valid.tsv")
it_train, it_train
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\gaoke\AppData\Local\Temp\jieba.cache
Loading model cost 0.570 seconds.
Prefix dict has been built successfully.
(<torchtext.data.iterator.Iterator at 0x1c9406f3588>,
<torchtext.data.iterator.Iterator at 0x1c9406f3588>)
模型
- 模型就使用LSTM
import torch
import torch.nn as nn
class RNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, batch_first=True, bidirectional=bidirectional)
self.fc = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, text):
embedded = self.dropout(self.embedding(text))
output, (hidden, cell) = self.rnn(embedded)
hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
return self.fc(hidden.squeeze(0))
訓(xùn)練
訓(xùn)練的核心函數(shù)
- 參數(shù):
- 訓(xùn)練集
- 驗證集
- 模型
import torch.nn.functional as F
def train(train_iter, valid_iter, model):
# 訓(xùn)練超參數(shù)
EPOCHES = 10
CUDA = torch.cuda.is_available() # GPU內(nèi)存不夠
# CUDA = False
if CUDA:
model.cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(1, EPOCHES):
for batch in train_iter: # 訓(xùn)練集
feature, target = batch.text, batch.label
if CUDA:
feature, target = feature.cuda(), target.cuda()
optimizer.zero_grad()
logits = model(feature)
loss = F.cross_entropy(logits, target)
loss.backward()
optimizer.step()
# 測試預(yù)測準(zhǔn)確率
corrects = 0.0
with torch.no_grad():
# sample_num樣本數(shù)量
sample_num = 0
for item in valid_iter:
feature, target = item.text, item.label
if CUDA:
feature, target = feature.cuda(), target.cuda()
logits = model(feature)
corrects += (torch.max(logits, 1)[1].view(target.size()).data == target.data).sum()
sample_num += len(feature)
print(F"輪數(shù):{epoch:03d},\t準(zhǔn)確率:{corrects/sample_num}")
準(zhǔn)備訓(xùn)練的條件
- 條件包含:
- 構(gòu)建網(wǎng)絡(luò)需要的參數(shù)
- 需要詞向量化過程中詞表等變量
- 數(shù)據(jù)集(已經(jīng)準(zhǔn)備好)
- 構(gòu)建網(wǎng)絡(luò)需要的參數(shù)
# 參數(shù)
vocabulary_size = len(fld_text.vocab)
embedding_dim = fld_text.vocab.vectors.size()[-1]
class_num = len(fld_label.vocab)
hidden_dim = 128
print(vocabulary_size, embedding_dim, hidden_dim, class_num)
# 構(gòu)建網(wǎng)絡(luò)模型
net = RNN(vocabulary_size, embedding_dim, hidden_dim, class_num)
11361 300 128 4
訓(xùn)練并驗證
print("開始訓(xùn)練....")
train(it_train, it_valid, net)
# 保存模型
torch.save(net.state_dict(), "rnn.model")
開始訓(xùn)練....
輪數(shù):001, 準(zhǔn)確率:0.9114285707473755
輪數(shù):002, 準(zhǔn)確率:0.9372857213020325
輪數(shù):003, 準(zhǔn)確率:0.9451428651809692
輪數(shù):004, 準(zhǔn)確率:0.9494285583496094
輪數(shù):005, 準(zhǔn)確率:0.9472857117652893
輪數(shù):006, 準(zhǔn)確率:0.9490000009536743
輪數(shù):007, 準(zhǔn)確率:0.951714277267456
輪數(shù):008, 準(zhǔn)確率:0.953000009059906
輪數(shù):009, 準(zhǔn)確率:0.9485714435577393
附錄:
- 預(yù)測的實現(xiàn)代碼就很簡單了插爹,這里就不列出了。