TORCH04-01TorchText之文本數(shù)據(jù)集處理

??原來一致湊合著使用Torch中torch.util.data下的數(shù)據(jù)集工具做數(shù)據(jù)處理，但是其中的DataLoader要求樣本的長度是對齊的毛好，而且對不同的數(shù)據(jù)源需要做細(xì)節(jié)處理孽水。
??Torch提供了torchtext.data模塊用來實現(xiàn)文本的處理陕贮，并并結(jié)合中文分詞工具，基本上可以滿足日常的文本處理了甜滨。
??這個主題就是介紹torchtext并入門乐严，主要介紹Field，Example衣摩，Dataset昂验，Vectors的使用，并使用LSTM網(wǎng)絡(luò)做了一個文本分類的例子。實際torchtext還是很彪悍的工具模塊凛篙。

tortext 模塊結(jié)構(gòu)

torchtext模塊包含文本數(shù)據(jù)處理與文本數(shù)據(jù)集
1. 文本數(shù)據(jù)處理
  1. torchtext.data
  2. torchtext.data.utils
  3. torchtext.data.functional
  4. torchtext.data.metrics
  5. torchtext.vocab
  6. torchtext.utils
2. 文本數(shù)據(jù)集
  1. torchtext.datasets
  2. torchtext.experimental.datasets
  3. examples
注意：
- 這里先從torchtext.data開始使用TorchText黍匾。

torchtext.data結(jié)構(gòu)

torchtext.data
1. Dataset, Batch, and Example
2. Fields
3. Iterators
4. Pipeline
5. Functions
文本處理的核心模式：
1. Dataset指定文本數(shù)據(jù)源；
2. Field指定處理字段呛梆；
3. Iterator遍歷數(shù)據(jù)集锐涯；
下面是TorchText的使用模式示意圖
- 參考鏈接：http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

TorchText使用模式示意圖

TorchText使用例子

下面我們從一個例子來說明TorchText的使用模式。
1. 環(huán)境安裝
2. 數(shù)據(jù)源
3. 定義字段Field
4. 構(gòu)建數(shù)據(jù)集
5. 構(gòu)建批次數(shù)據(jù)
6. 詞向量與構(gòu)建詞表
7. 使用數(shù)據(jù)集

環(huán)境安裝

安裝torchtext
- pip install torchtext

注意：
- 因為bug的緣故填物，建議采用直接在github安裝修正版：
  - pip install https://github.com/pytorch/text/archive/master.zip

安裝torchtext

可選安裝1 - 分詞工具
- pip install spacy
- python -m spacy download en

spacy官網(wǎng)：
- https://spacy.io/models/

Spacy分詞工具

注意：
- 安裝訓(xùn)練庫會因為網(wǎng)絡(luò)緣故纹腌，無法下載，可以使用如下安裝：
  - pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
  - 也可以直接下載滞磺，并安裝（這兒采用的升薯。）。

安裝庫

可選安裝2 - 分詞工具
- pip install sacremoses

安裝sacremoses

安裝 -分詞工具
- 結(jié)巴分詞
- pip install jieba

數(shù)據(jù)源

下載地址
- https://github.com/bigboNed3/chinese_text_cnn
下載的文件：
- 訓(xùn)練集：train.tsv
- 測試集：test.tsv
- 驗證集：dev.tsv

數(shù)據(jù)源文件

注意：
- 文件也可以使用其他方式存儲击困，比如text文件涎劈，json文件等。
數(shù)據(jù)格式：
- 序號（多余的字段）
- label
- text

數(shù)據(jù)格式

定義字段Field

Field類幫助文檔

構(gòu)造Field對象的參數(shù)設(shè)置有兩種方式：
1. 在構(gòu)造器中設(shè)置
2. 使用屬性設(shè)置（我們后面采用屬性設(shè)置）

from torchtext.data import Field

Field?

?[1;31mInit signature:?[0m
?[0mField?[0m?[1;33m(?[0m?[1;33m
?[0m    ?[0msequential?[0m?[1;33m=?[0m?[1;32mTrue?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0muse_vocab?[0m?[1;33m=?[0m?[1;32mTrue?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0minit_token?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0meos_token?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mfix_length?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mdtype?[0m?[1;33m=?[0m?[0mtorch?[0m?[1;33m.?[0m?[0mint64?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mpreprocessing?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mpostprocessing?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mlower?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mtokenize?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mtokenizer_language?[0m?[1;33m=?[0m?[1;34m'en'?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0minclude_lengths?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mbatch_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mpad_token?[0m?[1;33m=?[0m?[1;34m'<pad>'?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0munk_token?[0m?[1;33m=?[0m?[1;34m'<unk>'?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mpad_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mtruncate_first?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mstop_words?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m,?[0m?[1;33m
?[0m    ?[0mis_target?[0m?[1;33m=?[0m?[1;32mFalse?[0m?[1;33m,?[0m?[1;33m
?[0m?[1;33m)?[0m?[1;33m?[0m?[0m
?[1;31mDocstring:?[0m     
Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented
by tensors.  It holds a Vocab object that defines the set of possible values
for elements of the field and their corresponding numerical representations.
The Field object also holds other parameters relating to how a datatype
should be numericalized, such as a tokenization method and the kind of
Tensor that should be produced.

If a Field is shared between two columns in a dataset (e.g., question and
answer in a QA dataset), then they will have a shared vocabulary.

Attributes:
    sequential: Whether the datatype represents sequential data. If False,
        no tokenization is applied. Default: True.
    use_vocab: Whether to use a Vocab object. If False, the data in this
        field should already be numerical. Default: True.
    init_token: A token that will be prepended to every example using this
        field, or None for no initial token. Default: None.
    eos_token: A token that will be appended to every example using this
        field, or None for no end-of-sentence token. Default: None.
    fix_length: A fixed length that all examples using this field will be
        padded to, or None for flexible sequence lengths. Default: None.
    dtype: The torch.dtype class that represents a batch of examples
        of this kind of data. Default: torch.long.
    preprocessing: The Pipeline that will be applied to examples
        using this field after tokenizing but before numericalizing. Many
        Datasets replace this attribute with a custom preprocessor.
        Default: None.
    postprocessing: A Pipeline that will be applied to examples using
        this field after numericalizing but before the numbers are turned
        into a Tensor. The pipeline function takes the batch as a list, and
        the field's Vocab.
        Default: None.
    lower: Whether to lowercase the text in this field. Default: False.
    tokenize: The function used to tokenize strings using this field into
        sequential examples. If "spacy", the SpaCy tokenizer is
        used. If a non-serializable function is passed as an argument,
        the field will not be able to be serialized. Default: string.split.
    tokenizer_language: The language of the tokenizer to be constructed.
        Various languages currently supported only in SpaCy.
    include_lengths: Whether to return a tuple of a padded minibatch and
        a list containing the lengths of each examples, or just a padded
        minibatch. Default: False.
    batch_first: Whether to produce tensors with the batch dimension first.
        Default: False.
    pad_token: The string token used as padding. Default: "<pad>".
    unk_token: The string token used to represent OOV words. Default: "<unk>".
    pad_first: Do the padding of the sequence at the beginning. Default: False.
    truncate_first: Do the truncating of the sequence at the beginning. Default: False
    stop_words: Tokens to discard during the preprocessing step. Default: None
    is_target: Whether this field is a target variable.
        Affects iteration over batches. Default: False
?[1;31mFile:?[0m           c:\program files\python36\lib\site-packages\torchtext\data\field.py
?[1;31mType:?[0m           type
?[1;31mSubclasses:?[0m     ReversibleField, NestedField, LabelField, ShiftReduceField, ParsedTextField, BABI20Field

Field類說明

CLASS torchtext.data.Field(
    sequential=True,         # 是否序列數(shù)據(jù)阅茶， 默認(rèn)值True蛛枚，如果為False，則不需要init_token參數(shù)脸哀。
    use_vocab=True,          # 是否使用詞袋對象蹦浦，默認(rèn)True，如果指定False撞蜂，則不需要處理這個字典盲镶，表示這個字典默認(rèn)是Numerical。 
    init_token=None,         # 加載每個字段前的處理函數(shù)蝌诡。
    eos_token=None,          # 加載完每個字段后的處理函數(shù)溉贿。
    fix_length=None,         # 指定字段的文本長度。
    dtype=torch.int64,       # 數(shù)據(jù)類型送漠。
    preprocessing=None,      # 在token后顽照，轉(zhuǎn)換為Numerical之前的處理管道由蘑。
    postprocessing=None,     # 轉(zhuǎn)換為Numerical之后的處理管道闽寡。
    lower=False,             # 是否小寫轉(zhuǎn)換。
    tokenize=None,           # 用來把文本轉(zhuǎn)成序列文本單詞的函數(shù)尼酿，缺省的使用string.split函數(shù)爷狈，如果指定Spacy，就使用Spacy分詞工具裳擎。
    tokenizer_language='en', # 指定文本語言涎永，在指定除en意外語言，tokenize必須使用Spacy。
    include_lengths=False,   # 是否只返回補丁長度羡微，還是返回不定長度與數(shù)據(jù)長度谷饿。返回兩個長度使用元組類型。
    batch_first=False,       # 是否把批次大小放第一個維度（這是因為LSTM等網(wǎng)絡(luò)模塊對格式的要求）妈倔。
    pad_token='<pad>',       # 用來做補丁對齊處理的token函數(shù)博投。
    unk_token='<unk>',       # 出現(xiàn)OOV的處理函數(shù)。OOV（Out-of-vocabulary）就是出現(xiàn)不在詞袋內(nèi)的單詞的處理函數(shù)盯蝴。
    pad_first=False,         # 補丁對齊的兩種情況毅哗，在前補丁（True）捧挺，在后補堵敲唷（False）
    truncate_first=False,    # 文本超過長度的截斷方式：丟棄前面（True）與后面（False）
    stop_words=None,         # 預(yù)處理步驟中需要丟棄的單詞（停用詞）。
    is_target=False)         # 是否是標(biāo)簽字段闽烙。

構(gòu)建Feild字段的例子

根據(jù)上面的數(shù)據(jù)源來構(gòu)建三個字段：
1. 索引（無字段名）
2. 標(biāo)簽（label）
3. 特征（text）

構(gòu)建默認(rèn)對象
- 因為索引不是我們需要的數(shù)據(jù)列翅睛，所以該字段不用處理。

from torchtext.data import Field
fld_label = Field()
fld_text = Field()

設(shè)置基本屬性

# 標(biāo)簽字段比較簡答
fld_label.sequential = False     # 這個屬性默認(rèn)True
fld_label.use_vocab = False      # 這個屬性默認(rèn)True

# 特征字段
fld_text .sequential = True     # 這個屬性默認(rèn)True
fld_text .use_vocab = True      # 這個屬性默認(rèn)True
# 因為sequential為True黑竞，則必須指定分詞屬性token

設(shè)置token屬性宏所，指定分詞函數(shù)
- 該函數(shù)的要求：
  1. 參數(shù)：傳入一個樣本的特征（就是text字段）
  2. 返回：返回一個列表，就是分詞以后的結(jié)果摊溶，這樣字段的數(shù)據(jù)就不是字符串爬骤，而是單詞列表。

import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]

fld_text .tokenize = word_cut

構(gòu)建數(shù)據(jù)集

Dataset的幫助文檔

from torchtext.data import Dataset
Dataset?

?[1;31mInit signature:?[0m ?[0mDataset?[0m?[1;33m(?[0m?[0mexamples?[0m?[1;33m,?[0m ?[0mfields?[0m?[1;33m,?[0m ?[0mfilter_pred?[0m?[1;33m=?[0m?[1;32mNone?[0m?[1;33m)?[0m?[1;33m?[0m?[0m
?[1;31mDocstring:?[0m     
Defines a dataset composed of Examples along with its Fields.

Attributes:
    sort_key (callable): A key to use for sorting dataset examples for batching
        together examples with similar lengths to minimize padding.
    examples (list(Example)): The examples in this dataset.
    fields (dict[str, Field]): Contains the name of each column or field, together
        with the corresponding Field object. Two fields with the same Field object
        will have a shared vocabulary.
?[1;31mInit docstring:?[0m
Create a dataset from a list of Examples and Fields.

Arguments:
    examples: List of Examples.
    fields (List(tuple(str, Field))): The Fields to use in this tuple. The
        string is a field name, and the Field is the associated field.
    filter_pred (callable or None): Use only examples for which
        filter_pred(example) is True, or use all examples if None.
        Default is None.
?[1;31mFile:?[0m           c:\program files\python36\lib\site-packages\torchtext\data\dataset.py
?[1;31mType:?[0m           type
?[1;31mSubclasses:?[0m     TabularDataset, LanguageModelingDataset, SST, TranslationDataset, SequenceTaggingDataset, TREC, IMDB, BABI20

Dataset屬性說明

構(gòu)造器說明：

    Dataset(examples, fields, filter_pred=None)
          # examples：數(shù)據(jù)列表莫换，類型是Examples列表霞玄。
          # fields：字段列表，類型是tuple(str, Field))列表拉岁。
          # filter_pred：過濾數(shù)據(jù)集的條件坷剧，類型是可調(diào)用對象或者函數(shù)，樣本是否使用喊暖，根據(jù)函數(shù)的返回值確定惫企。True就使用。若為None陵叽，樣本全部使用狞尔。

屬性說明：
1. sort_key ：類型是callable：
2. examples：類型list(Example)
3. fields：類型dict[str, Field]

構(gòu)建數(shù)據(jù)集的字段

Dataset需要的字段是列表類型：fields (List(tuple(str, Field)))
下面例子是完整的字段的構(gòu)建例子

from torchtext.data import Field
import re
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
# ----------------------------------------------------------------------
# 1. 數(shù)據(jù)集需要的Fields定義:fields (List(tuple(str, Field)))

fld_label = Field()
fld_text = Field()
# 標(biāo)簽字段比較簡答
fld_label.sequential = False     # 這個屬性默認(rèn)True
fld_label.use_vocab = False      # 這個屬性默認(rèn)True

# 特征字段
fld_text .sequential = True     # 這個屬性默認(rèn)True
fld_text .use_vocab = True      # 這個屬性默認(rèn)True

# 因為sequential為True，則必須指定分詞屬性token
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut

# 構(gòu)建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)]   # 兩個字段

fields

[('text', <torchtext.data.field.Field at 0x2a9d2336a20>),
 ('label', <torchtext.data.field.Field at 0x2a9d2336780>)]

Example幫助文檔

from torchtext.data import Example
help(Example)

Help on class Example in module torchtext.data.example:

class Example(builtins.object)
 |  Defines a single training or test example.
 |  
 |  Stores each column of the example as an attribute.
 |  
 |  Class methods defined here:
 |  
 |  fromCSV(data, fields, field_to_index=None) from builtins.type
 |  
 |  fromJSON(data, fields) from builtins.type
 |  
 |  fromdict(data, fields) from builtins.type
 |  
 |  fromlist(data, fields) from builtins.type
 |  
 |  fromtree(data, fields, subtrees=False) from builtins.type
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

構(gòu)建Example對象

Example提供一組類函數(shù)實現(xiàn)Example對象構(gòu)建巩掺，所謂的工廠模式就是這個了偏序。
- 參數(shù)需要數(shù)據(jù)與字段描述。
- 數(shù)據(jù)與字段的長度應(yīng)該是對應(yīng)的胖替。

from torchtext.data import Field
from torchtext.data import Example

# 這里使用上面構(gòu)建的fields研儒，上面的fields是否正確豫缨，這個就可以檢測
one_example = Example.fromlist(["我是數(shù)據(jù)，很長的數(shù)據(jù)", 1], fields)     # 1是標(biāo)簽
one_example

<torchtext.data.example.Example at 0x2a9d22e2ba8>

構(gòu)建Example列表

Example列表的構(gòu)建需要數(shù)據(jù)源的數(shù)據(jù) 端朵。
- 可以使用[..., ..., ...]構(gòu)建好芭，下面數(shù)據(jù)多，我們使用循環(huán)構(gòu)建冲呢，
- 數(shù)據(jù)是csv格式栓撞，csv得分隔符可以體現(xiàn)在擴展名上。
  - csv: Comma-Separated Values碗硬，
  - tsv: Tab-Separated Values

import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
# ----------------------------------------------------------------------
# 2. 數(shù)據(jù)集需要的exampls列表(list(Example)):
# 使用pandas讀取csv文件瓤湘，其他方式也可以。比如csv庫恩尾。
data = pd.read_csv("datasets/train.tsv", sep='\t')   # csv: Comma-Separated Values弛说，tsv: Tab-Separated Values

examples = []
for txt, lab in zip(data["text"], data["label"]):
    one_example = Example.fromlist([txt, lab], fields)
    examples.append(one_example)
examples[0:5]    # 顯示5個

[<torchtext.data.example.Example at 0x2a9d233c4a8>,
 <torchtext.data.example.Example at 0x2a9b5fee9e8>,
 <torchtext.data.example.Example at 0x2a9d8fae9e8>,
 <torchtext.data.example.Example at 0x2a9d8faea90>,
 <torchtext.data.example.Example at 0x2a9d8fae9b0>]

構(gòu)建數(shù)據(jù)集

使用Dataset構(gòu)造器構(gòu)建數(shù)據(jù)集
- Dataset(examples, fields, filter_pred=None)

from torchtext.data import Dataset

# 這個數(shù)據(jù)集與torch.utils.data的Dataset是有差異的。 torch.utils.data的DataLoader要求數(shù)據(jù)是整齊的翰意，就是每個記錄長度一樣木人。
dataset = Dataset(examples, fields)
dataset

<torchtext.data.dataset.Dataset at 0x2a9b7f1a780>

深入理解數(shù)據(jù)集

Dataset應(yīng)該提供函數(shù)操作數(shù)據(jù)。下面通過幫助文檔了解冀偶。
- 尤其提供數(shù)據(jù)的變量訪問：
  1. __getitem__(self, i)
  2. __len__(self)
  3. __iter__(self)
  4. split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
    - 數(shù)據(jù)集切分：訓(xùn)練集 + 測試機
  5. filter_examples(self, field_names)
    - 根據(jù)字段過濾數(shù)據(jù)字段醒第。

help(dataset)

Help on Dataset in module torchtext.data.dataset object:

class Dataset(torch.utils.data.dataset.Dataset)
 |  Defines a dataset composed of Examples along with its Fields.
 |  
 |  Attributes:
 |      sort_key (callable): A key to use for sorting dataset examples for batching
 |          together examples with similar lengths to minimize padding.
 |      examples (list(Example)): The examples in this dataset.
 |      fields (dict[str, Field]): Contains the name of each column or field, together
 |          with the corresponding Field object. Two fields with the same Field object
 |          will have a shared vocabulary.
 |  
 |  Method resolution order:
 |      Dataset
 |      torch.utils.data.dataset.Dataset
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getattr__(self, attr)
 |  
 |  __getitem__(self, i)
 |  
 |  __init__(self, examples, fields, filter_pred=None)
 |      Create a dataset from a list of Examples and Fields.
 |      
 |      Arguments:
 |          examples: List of Examples.
 |          fields (List(tuple(str, Field))): The Fields to use in this tuple. The
 |              string is a field name, and the Field is the associated field.
 |          filter_pred (callable or None): Use only examples for which
 |              filter_pred(example) is True, or use all examples if None.
 |              Default is None.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  filter_examples(self, field_names)
 |      Remove unknown words from dataset examples with respect to given field.
 |      
 |      Arguments:
 |          field_names (list(str)): Within example only the parts with field names in
 |              field_names will have their unknown words deleted.
 |  
 |  split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
 |      Create train-test(-valid?) splits from the instance's examples.
 |      
 |      Arguments:
 |          split_ratio (float or List of floats): a number [0, 1] denoting the amount
 |              of data to be used for the training split (rest is used for test),
 |              or a list of numbers denoting the relative sizes of train, test and valid
 |              splits respectively. If the relative size for valid is missing, only the
 |              train-test split is returned. Default is 0.7 (for the train set).
 |          stratified (bool): whether the sampling should be stratified.
 |              Default is False.
 |          strata_field (str): name of the examples Field stratified over.
 |              Default is 'label' for the conventional label field.
 |          random_state (tuple): the random seed used for shuffling.
 |              A return value of `random.getstate()`.
 |      
 |      Returns:
 |          Tuple[Dataset]: Datasets for train, validation, and
 |          test splits in that order, if the splits are provided.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  download(root, check=None) from builtins.type
 |      Download and unzip an online archive (.zip, .gz, or .tgz).
 |      
 |      Arguments:
 |          root (str): Folder to download data to.
 |          check (str or None): Folder whose existence indicates
 |              that the dataset has already been downloaded, or
 |              None to check the existence of root/{cls.name}.
 |      
 |      Returns:
 |          str: Path to extracted dataset.
 |  
 |  splits(path=None, root='.data', train=None, validation=None, test=None, **kwargs) from builtins.type
 |      Create Dataset objects for multiple splits of a dataset.
 |      
 |      Arguments:
 |          path (str): Common prefix of the splits' file paths, or None to use
 |              the result of cls.download(root).
 |          root (str): Root dataset storage directory. Default is '.data'.
 |          train (str): Suffix to add to path for the train set, or None for no
 |              train set. Default is None.
 |          validation (str): Suffix to add to path for the validation set, or None
 |              for no validation set. Default is None.
 |          test (str): Suffix to add to path for the test set, or None for no test
 |              set. Default is None.
 |          Remaining keyword arguments: Passed to the constructor of the
 |              Dataset (sub)class being used.
 |      
 |      Returns:
 |          Tuple[Dataset]: Datasets for train, validation, and
 |          test splits in that order, if provided.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  sort_key = None
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from torch.utils.data.dataset.Dataset:
 |  
 |  __add__(self, other)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from torch.utils.data.dataset.Dataset:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

數(shù)據(jù)集遍歷方式1

# 數(shù)據(jù)集訪問與遍歷
for i in range(5):  #len(dataset)
    print(dataset[i])

<torchtext.data.example.Example object at 0x000002A9D233C4A8>
<torchtext.data.example.Example object at 0x000002A9B5FEE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAEA90>
<torchtext.data.example.Example object at 0x000002A9D8FAE9B0>

數(shù)據(jù)集遍歷方式2

# 數(shù)據(jù)集訪問與遍歷
for one_ex in dataset: 
    print(one_ex)
    break

<torchtext.data.example.Example object at 0x000002A9D233C4A8>

構(gòu)建批次數(shù)據(jù)

數(shù)據(jù)集的數(shù)據(jù)使用迭代器來完成訪問。從上面例子應(yīng)該知道进鸠，從數(shù)據(jù)集無法訪問到具體的數(shù)據(jù)值稠曼，沒有提供訪問的標(biāo)準(zhǔn)接口。

Iterator的幫助文檔

使用Iterator類也是兩種方式：
1. 構(gòu)造器：
  - __init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
2. 使用類函數(shù)
  - splits(datasets, batch_sizes=None, **kwargs)

Iterator提供了數(shù)據(jù)遍歷方式客年，只是遍歷的是批次霞幅。
- __iter__(self)
- __len__(self)
直接返回數(shù)據(jù)：
- data(self)

from torchtext.data import Iterator
help(Iterator)

Help on class Iterator in module torchtext.data.iterator:

class Iterator(builtins.object)
 |  Defines an iterator that loads batches of data from a Dataset.
 |  
 |  Attributes:
 |      dataset: The Dataset object to load Examples from.
 |      batch_size: Batch size.
 |      batch_size_fn: Function of three arguments (new example to add, current
 |          count of examples in the batch, and current effective batch size)
 |          that returns the new effective batch size resulting from adding
 |          that example to a batch. This is useful for dynamic batching, where
 |          this function would add to the current effective batch size the
 |          number of tokens in the new example.
 |      sort_key: A key to use for sorting examples in order to batch together
 |          examples with similar lengths and minimize padding. The sort_key
 |          provided to the Iterator constructor overrides the sort_key
 |          attribute of the Dataset, or defers to it if None.
 |      train: Whether the iterator represents a train set.
 |      repeat: Whether to repeat the iterator for multiple epochs. Default: False.
 |      shuffle: Whether to shuffle examples between epochs.
 |      sort: Whether to sort examples according to self.sort_key.
 |          Note that shuffle and sort default to train and (not train).
 |      sort_within_batch: Whether to sort (in descending order according to
 |          self.sort_key) within each batch. If None, defaults to self.sort.
 |          If self.sort is True and this is False, the batch is left in the
 |          original (ascending) sorted order.
 |      device (str or `torch.device`): A string or instance of `torch.device`
 |          specifying which device the Variables are going to be created on.
 |          If left as default, the tensors will be created on cpu. Default: None.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  create_batches(self)
 |  
 |  data(self)
 |      Return the examples in the dataset in order, sorted, or shuffled.
 |  
 |  init_epoch(self)
 |      Set up the batch generator for a new epoch.
 |  
 |  load_state_dict(self, state_dict)
 |  
 |  state_dict(self)
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  splits(datasets, batch_sizes=None, **kwargs) from builtins.type
 |      Create Iterator objects for multiple splits of a dataset.
 |      
 |      Arguments:
 |          datasets: Tuple of Dataset objects corresponding to the splits. The
 |              first such object should be the train set.
 |          batch_sizes: Tuple of batch sizes to use for the different splits,
 |              or None to use the same batch_size for all splits.
 |          Remaining keyword arguments: Passed to the constructor of the
 |              iterator class being used.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  epoch

使用splits函數(shù)構(gòu)建Iterator對象

splits函數(shù)的核心參數(shù)是datasets與batch_size
- datasets：需要list類型；
- batch_size：類型與datasets匹配量瓜；

from torchtext.data import Iterator
print(len(dataset))
it_dataset, = Iterator.splits((dataset, ), batch_sizes=(100, )) 
it_dataset, len(it_dataset)

6300





(<torchtext.data.iterator.Iterator at 0x2a9d9c0b0b8>, 63)

詞向量與構(gòu)建詞表

構(gòu)建的Iterator還不能直接工作司恳，因為Iterator的工作需要詞表，通過詞表才能把文本轉(zhuǎn)化為數(shù)值（原理是TF詞頻）
構(gòu)建此表兩種方式
- 使用與訓(xùn)練的詞向量绍傲，使用vectors參數(shù)指定
- 使用默認(rèn)的詞向量扔傅，設(shè)置vectors = None

預(yù)訓(xùn)練的詞向量

這里我們只關(guān)心中文，英文可以使用spacy與sacremoses
- 下載地址：https://github.com/Embedding/Chinese-Word-Vectors

預(yù)先訓(xùn)練的詞向量

下載的詞向量文件
- 700+Mb烫饼，比較刺激的文件猎塞。
  
  詞向量訓(xùn)練文件
加載詞向量文件

from torchtext.vocab import Vectors
# 會有一個加載過程。
vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
vectors

  0%|    
  | 0/259922 [00:00<?, ?it/s]Skipping token b'259922' with 1-dimensional vector [b'300']; likely a header
100%|██████████████████████████████████████████████████| 259922/259922 [00:30<00:00, 8568.81it/s]

<torchtext.vocab.Vectors at 0x2a9d9b11ac8>

使用詞向量構(gòu)建詞表

# 文本使用預(yù)先訓(xùn)練的詞向量
fld_text.build_vocab(dataset, vectors = vectors)   # 見上面的詞向量

# 標(biāo)簽是整數(shù)枫弟，不用詞向量邢享。
fld_label.build_vocab(dataset)

使用數(shù)據(jù)集

遍歷

現(xiàn)在可以使用it_dataset迭代數(shù)據(jù)集了Iterator鹏往。
- __iter__(self)
- __len__(self)
- 注意：沒有__item__函數(shù)淡诗。智能迭代骇塘。

for item  in  it_dataset:
    print(item)

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 54x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 48x100]
    [.label]:[torch.LongTensor of size 100]

 .......

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 54x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 53x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 48x100]
    [.label]:[torch.LongTensor of size 100]

取數(shù)據(jù)

取文本

for item  in  it_dataset:
    print(item.text)    # item.label

tensor([[ 284, 2568,  115,  ...,   66,   62,   14],
        [1041,    2,  990,  ...,  848,   92,  158],
        [ 445,  369,   17,  ...,   19,  585, 1103],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])
......
tensor([[  96,  548,  197,  ...,   45,   12,   47],
        [ 635, 1167,   62,  ..., 1036, 1306,   10],
        [9668,   14,   14,  ...,  357, 1329,   36],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])

取標(biāo)簽

for item  in  it_dataset:
    print(item.label)

tensor([0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,
        0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
        0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
        1, 0, 1, 0])

......
tensor([1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
        1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 1])

文本分類中的TorchText應(yīng)用

數(shù)據(jù)集處理

函數(shù)封裝

import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
from torchtext.data import Dataset
from torchtext.data import Iterator
from torchtext.vocab import Vectors
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
fld_label = Field()
fld_text = Field()
# 標(biāo)簽字段比較簡答
fld_label.sequential = False     # 這個屬性默認(rèn)True
fld_label.use_vocab = False      # 這個屬性默認(rèn)True

# 特征字段
fld_text.sequential = True     # 這個屬性默認(rèn)True
fld_text.use_vocab = True      # 這個屬性默認(rèn)True
fld_text.batch_first=True

# 因為sequential為True，則必須指定分詞屬性token
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
# 構(gòu)建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)]   # 兩個字段
    
def load_data(data_file):

    
    data = pd.read_csv(data_file, sep='\t')   # csv: Comma-Separated Values韩容，tsv: Tab-Separated Values

    examples = []
    for txt, lab in zip(data["text"], data["label"]):
        one_example = Example.fromlist([txt, lab], fields)
        examples.append(one_example)

    dataset = Dataset(examples, fields)

    it_dataset, = Iterator.splits((dataset, ), batch_sizes=(1000, ))    # 每個批次過大款违，GPU容易溢出

    vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
    fld_text.build_vocab(dataset, vectors = vectors)   # 見上面的詞向量
    # 標(biāo)簽是整數(shù)，不用詞向量群凶。
    fld_label.build_vocab(dataset)
    
    return it_dataset

加載訓(xùn)練集與測試集

數(shù)據(jù)集文件說明：
- 訓(xùn)練集：train.tsv
- 驗證集：valid.tsv

it_train = load_data("datasets/train.tsv")
it_valid = load_data("datasets/valid.tsv")
it_train, it_train

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\gaoke\AppData\Local\Temp\jieba.cache
Loading model cost 0.570 seconds.
Prefix dict has been built successfully.


(<torchtext.data.iterator.Iterator at 0x1c9406f3588>,
 <torchtext.data.iterator.Iterator at 0x1c9406f3588>)

模型

模型就使用LSTM

import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
                 n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, batch_first=True, bidirectional=bidirectional)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        embedded = self.dropout(self.embedding(text))
        output, (hidden, cell) = self.rnn(embedded)
        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))

        return self.fc(hidden.squeeze(0))

訓(xùn)練

訓(xùn)練的核心函數(shù)

參數(shù)：
1. 訓(xùn)練集
2. 驗證集
3. 模型

import torch.nn.functional as F
def train(train_iter, valid_iter, model):
    # 訓(xùn)練超參數(shù)
    EPOCHES = 10
    CUDA = torch.cuda.is_available()   # GPU內(nèi)存不夠
    # CUDA = False
    if CUDA:
        model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    for epoch in range(1, EPOCHES):
        for batch in train_iter:  # 訓(xùn)練集
            feature, target = batch.text, batch.label
            if CUDA:
                feature, target = feature.cuda(), target.cuda()
            optimizer.zero_grad()
            logits = model(feature)
            loss = F.cross_entropy(logits, target)
            loss.backward()
            optimizer.step()

        # 測試預(yù)測準(zhǔn)確率
        corrects = 0.0
        with torch.no_grad():
            # sample_num樣本數(shù)量
            sample_num = 0
            for item in valid_iter:
                feature, target = item.text, item.label
                if CUDA:
                    feature, target = feature.cuda(), target.cuda()
                logits = model(feature)
                corrects += (torch.max(logits, 1)[1].view(target.size()).data == target.data).sum()
                sample_num += len(feature)
            print(F"輪數(shù)：{epoch:03d},\t準(zhǔn)確率:{corrects/sample_num}")

準(zhǔn)備訓(xùn)練的條件

條件包含：
- 構(gòu)建網(wǎng)絡(luò)需要的參數(shù)
  - 需要詞向量化過程中詞表等變量
- 數(shù)據(jù)集（已經(jīng)準(zhǔn)備好）

# 參數(shù)
vocabulary_size = len(fld_text.vocab)
embedding_dim = fld_text.vocab.vectors.size()[-1]
class_num = len(fld_label.vocab)
hidden_dim = 128
print(vocabulary_size, embedding_dim, hidden_dim, class_num)
# 構(gòu)建網(wǎng)絡(luò)模型
net = RNN(vocabulary_size, embedding_dim, hidden_dim, class_num)

11361 300 128 4

訓(xùn)練并驗證

print("開始訓(xùn)練....")
train(it_train, it_valid, net)

# 保存模型
torch.save(net.state_dict(), "rnn.model")

開始訓(xùn)練....
輪數(shù)：001, 準(zhǔn)確率:0.9114285707473755
輪數(shù)：002, 準(zhǔn)確率:0.9372857213020325
輪數(shù)：003, 準(zhǔn)確率:0.9451428651809692
輪數(shù)：004, 準(zhǔn)確率:0.9494285583496094
輪數(shù)：005, 準(zhǔn)確率:0.9472857117652893
輪數(shù)：006, 準(zhǔn)確率:0.9490000009536743
輪數(shù)：007, 準(zhǔn)確率:0.951714277267456
輪數(shù)：008, 準(zhǔn)確率:0.953000009059906
輪數(shù)：009, 準(zhǔn)確率:0.9485714435577393

附錄：

預(yù)測的實現(xiàn)代碼就很簡單了插爹，這里就不列出了。

最后編輯于：2020.04.07 10:29:41

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末请梢，一起剝皮案震驚了整個濱河市赠尾，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌毅弧，老刑警劉巖气嫁，帶你破解...
沈念sama閱讀 219,589評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異够坐，居然都是意外死亡寸宵，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,615評論 3贊 396
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門元咙，熙熙樓的掌柜王于貴愁眉苦臉地迎上來梯影，“玉大人，你說我怎么就攤上這事庶香〖坠鳎” “怎么了？”我有些...
開封第一講書人閱讀 165,933評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵赶掖，是天一觀的道長救军。經(jīng)常有香客問我，道長倘零，這世上最難降的妖魔是什么唱遭？我笑而不...
開封第一講書人閱讀 58,976評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮呈驶，結(jié)果婚禮上拷泽，老公的妹妹穿的比我還像新娘。我一直安慰自己袖瞻，他們只是感情好司致，可當(dāng)我...
茶點故事閱讀 67,999評論 6贊 393
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著聋迎，像睡著了一般脂矫。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上霉晕，一...
開封第一講書人閱讀 51,775評論 1贊 307
城市分裂傳說
那天庭再，我揣著相機與錄音捞奕，去河邊找鬼。笑死拄轻，一個胖子當(dāng)著我的面吹牛颅围，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播恨搓，決...
沈念sama閱讀 40,474評論 3贊 420
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼院促，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了斧抱？” 一聲冷哼從身側(cè)響起常拓，我...
開封第一講書人閱讀 39,359評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎辉浦，沒想到半個月后墩邀，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,854評論 1贊 317
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡盏浙，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 38,007評論 3贊 338
?白月光啟示錄
正文我和宋清朗相戀三年眉睹，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片废膘。...
茶點故事閱讀 40,146評論 1贊 351
活死人
序言：一個原本活蹦亂跳的男人離奇死亡竹海，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出丐黄，到底是詐尸還是另有隱情斋配，我是刑警寧澤，帶...
沈念sama閱讀 35,826評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布灌闺，位于F島的核電站艰争，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏桂对。R本人自食惡果不足惜甩卓，卻給世界環(huán)境...
茶點故事閱讀 41,484評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望蕉斜。院中可真熱鬧逾柿，春花似錦、人聲如沸宅此。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,029評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽父腕。三九已至弱匪，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間璧亮，已是汗流浹背萧诫。一陣腳步聲響...
開封第一講書人閱讀 33,153評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工斥难，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人财搁。一個月前我還...
沈念sama閱讀 48,420評論 3贊 373
代替公主和親
正文我出身青樓蘸炸，卻偏偏與公主長得像躬络，于是被迫代替她去往敵國和親尖奔。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 45,107評論 2贊 356

TORCH04-01TorchText之文本數(shù)據(jù)集處理

tortext 模塊結(jié)構(gòu)

torchtext.data結(jié)構(gòu)

TorchText使用例子

環(huán)境安裝

數(shù)據(jù)源

定義字段Field

Field類幫助文檔

Field類說明

構(gòu)建Feild字段的例子

構(gòu)建數(shù)據(jù)集

Dataset的幫助文檔

Dataset屬性說明

構(gòu)建數(shù)據(jù)集的字段

Example幫助文檔

構(gòu)建Example對象

構(gòu)建Example列表

構(gòu)建數(shù)據(jù)集

深入理解數(shù)據(jù)集

構(gòu)建批次數(shù)據(jù)

Iterator的幫助文檔

使用splits函數(shù)構(gòu)建Iterator對象

詞向量與構(gòu)建詞表

預(yù)訓(xùn)練的詞向量

使用詞向量構(gòu)建詞表

使用數(shù)據(jù)集

遍歷

取數(shù)據(jù)

文本分類中的TorchText應(yīng)用

數(shù)據(jù)集處理

函數(shù)封裝

加載訓(xùn)練集與測試集

模型

訓(xùn)練

訓(xùn)練的核心函數(shù)

準(zhǔn)備訓(xùn)練的條件

訓(xùn)練并驗證

推薦閱讀更多精彩內(nèi)容