Convolutional Image Captioning

github代碼地址:https://github.com/aditya12agd5/convcap
論文:Convolutional Image Captioning


該網(wǎng)絡(luò)簡(jiǎn)單地說(shuō)就是使用VGG16提取特征消约,通過Attention+LSTM進(jìn)行語(yǔ)句生成的端到端網(wǎng)絡(luò)。不說(shuō)了先上網(wǎng)絡(luò)總體結(jié)構(gòu)圖抹剩。

網(wǎng)絡(luò)結(jié)構(gòu)圖.png

論文是我懵逼,我還是從代碼說(shuō)吧抢野。

1.特征提取網(wǎng)絡(luò)VGG16

特征提取模塊就是一個(gè)VGG16.
vggfeats.py

import torch
import torch.nn as nn
from torchvision import models
from torch.autograd import Variable
pretrained_model = models.vgg16(pretrained=True)

class Vgg16Feats(nn.Module):
  def __init__(self):
    super(Vgg16Feats, self).__init__()
    self.features_nopool = nn.Sequential(*list(pretrained_model.features.children())[:-1])
    self.features_pool = list(pretrained_model.features.children())[-1]
    self.classifier = nn.Sequential(*list(pretrained_model.classifier.children())[:-1])

  def forward(self, x):
    # x:[20,512,14,14]
    x = self.features_nopool(x)
    # y:[20,512,7,7]
    x_pool = self.features_pool(x)
    # x_feat:[20,25088]
    x_feat = x_pool.view(x_pool.size(0), -1)
    # y:[20,4096]
    y = self.classifier(x_feat)
    return x_pool, y

2.convcap主體網(wǎng)絡(luò)

我繪制的convcap主體網(wǎng)絡(luò)毁习,很難看。

convcap主體網(wǎng)絡(luò)

convcap主體網(wǎng)絡(luò)流程手稿:
convcap主體網(wǎng)絡(luò)

attention流程手稿
attention流程手稿

convcap.py

# -*- coding: utf-8 -*-
import sys

import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

#Layers adapted for captioning from https://arxiv.org/abs/1705.03122
def Conv1d(in_channels, out_channels, kernel_size, padding, dropout=0):
    m = nn.Conv1d(in_channels, out_channels, kernel_size, padding=padding)
    std = math.sqrt((4 * (1.0 - dropout)) / (kernel_size * in_channels))
    m.weight.data.normal_(mean=0, std=std)
    m.bias.data.zero_()
    return nn.utils.weight_norm(m)

def Embedding(num_embeddings, embedding_dim, padding_idx):
    m = nn.Embedding(num_embeddings, embedding_dim, padding_idx=padding_idx)
    m.weight.data.normal_(0, 0.1)
    return m

def Linear(in_features, out_features, dropout=0.):
    m = nn.Linear(in_features, out_features)
    m.weight.data.normal_(mean=0, std=math.sqrt((1 - dropout) / in_features))
    m.bias.data.zero_()
    return nn.utils.weight_norm(m)
# 注意力層馍悟,
class AttentionLayer(nn.Module):
  def __init__(self, conv_channels, embed_dim):
    super(AttentionLayer, self).__init__()
    self.in_projection = Linear(conv_channels, embed_dim)
    self.out_projection = Linear(embed_dim, conv_channels)
    self.bmm = torch.bmm

  def forward(self, x, wordemb, imgsfeats):
    residual = x

    x = (self.in_projection(x) + wordemb) * math.sqrt(0.5)

    b, c, f_h, f_w = imgsfeats.size()
    y = imgsfeats.view(b, c, f_h*f_w)
    # 批二維矩陣乘法
    x = self.bmm(x, y)

    sz = x.size()
    x = F.softmax(x.view(sz[0] * sz[1], sz[2]))
    x = x.view(sz)
    attn_scores = x
    # 矩陣的維度換位
    y = y.permute(0, 2, 1)

    x = self.bmm(x, y)

    s = y.size(1)
    x = x * (s * math.sqrt(1.0 / s))

    x = (self.out_projection(x) + residual) * math.sqrt(0.5)

    return x, attn_scores

class convcap(nn.Module):
  
  def __init__(self, num_wordclass, num_layers=1, is_attention=True, nfeats=512, dropout=.1):
    super(convcap, self).__init__()
    # 說(shuō)明使用的是VGG16的全鏈接層的特征
    self.nimgfeats = 4096
    self.is_attention = is_attention
    # 每個(gè)單詞的特征維度
    self.nfeats = nfeats
    # 棄權(quán)率10%
    self.dropout = dropout 

    # 初始化詞向量
    self.emb_0 = Embedding(num_wordclass, nfeats, padding_idx=0)
    # 初始化一個(gè)輸出輸入大小微單詞特征的全鏈接層
    self.emb_1 = Linear(nfeats, nfeats, dropout=dropout)
    # 初始化輸入微4906,輸出微單詞特征的全鏈接層
    self.imgproj = Linear(self.nimgfeats, self.nfeats, dropout=dropout)
    # 初始化輸入微單詞特征*2,輸出微單詞特征的全練級(jí)層柠衍,
    self.resproj = Linear(nfeats*2, self.nfeats, dropout=dropout)

    n_in = 2*self.nfeats 
    n_out = self.nfeats
    self.n_layers = num_layers
    # 生成卷積以及注意力的操作列表
    self.convs = nn.ModuleList()
    self.attention = nn.ModuleList()
    # 核大小
    self.kernel_size = 5
    # 擴(kuò)邊大小
    self.pad = self.kernel_size - 1
    for i in range(self.n_layers):
      self.convs.append(Conv1d(n_in, 2*n_out, self.kernel_size, self.pad, dropout))
      if(self.is_attention):
        self.attention.append(AttentionLayer(n_out, nfeats))
      n_in = n_out
    # 后兩層作為單詞類別識(shí)別
    self.classifier_0 = Linear(self.nfeats, (nfeats // 2))
    self.classifier_1 = Linear((nfeats // 2), num_wordclass, dropout=dropout)

  def forward(self, imgsfeats, imgsfc7, wordclass):

    attn_buffer = None
    # 句子的此向量
    wordemb = self.emb_0(wordclass)
    # 句子向量進(jìn)行一次全鏈接
    wordemb = self.emb_1(wordemb)
    # wordemb洋满,第二維15個(gè)的單詞,第三位每個(gè)單詞的特征x:[100, 512, 15]
    x = wordemb.transpose(2, 1)   
    batchsize, wordembdim, maxtokens = x.size()
    # 將輸入特征從4096變?yōu)?12珍坊,在第三位復(fù)制15份牺勾,表示15個(gè)句子·y:[100, 512, 15]
    y = F.relu(self.imgproj(imgsfc7))
    y = y.unsqueeze(2).expand(batchsize, self.nfeats, maxtokens)
    # 將特征與結(jié)果特征拼接,得到x: [100,1024, 15]
    x = torch.cat([x, y], 1)

    for i, conv in enumerate(self.convs):
      
      if(i == 0):
        #   將1,2維變化位置得到x:[100,15,1024]
        x = x.transpose(2, 1)
        # residual:[100, 512, 15]  x: [100, 1024, 15]
        residual = self.resproj(x)
        residual = residual.transpose(2, 1)
        x = x.transpose(2, 1)
      else:
        residual = x
      # 棄權(quán)
      x = F.dropout(x, p=self.dropout, training=self.training)
      # 一維卷積
      x = conv(x)
      x = x[:,:,:-self.pad]

      x = F.glu(x, dim=1)

      if(self.is_attention):
        attn = self.attention[i]
        x = x.transpose(2, 1)
        # x圖像全連接層與詞向量的組合阵漏,wordemb詞向量驻民,imgsfeats全連接層前的特征
        x, attn_buffer = attn(x, wordemb, imgsfeats)
        x = x.transpose(2, 1)
    
      x = (x+residual)*math.sqrt(.5)

    x = x.transpose(2, 1)
  
    x = self.classifier_0(x)
    x = F.dropout(x, p=self.dropout, training=self.training)
    x = self.classifier_1(x)

    x = x.transpose(2, 1)

    return x, attn_buffer

3.訓(xùn)練

train.py
僅為部分代碼

    for batch_idx, (imgs, captions, wordclass, mask, _) in \
      tqdm(enumerate(train_data_loader), total=nbatches):

      imgs = imgs.view(batchsize, 3, 224, 224)
      wordclass = wordclass.view(batchsize_cap, max_tokens)
      mask = mask.view(batchsize_cap, max_tokens)

      imgs_v = Variable(imgs).cuda()
      wordclass_v = Variable(wordclass).cuda()

      optimizer.zero_grad()
      if(img_optimizer):
        img_optimizer.zero_grad() 
      # 提取圖像特征
      imgsfeats, imgsfc7 = model_imgcnn(imgs_v)
      imgsfeats, imgsfc7 = repeat_img_per_cap(imgsfeats, imgsfc7, ncap_per_img)
      _, _, feat_h, feat_w = imgsfeats.size()
      # 執(zhí)行concap部分網(wǎng)絡(luò)獲取輸出語(yǔ)句及attention
      if(args.attention == True):
        wordact, attn = model_convcap(imgsfeats, imgsfc7, wordclass_v)
        attn = attn.view(batchsize_cap, max_tokens, feat_h, feat_w)
      else:
        wordact, _ = model_convcap(imgsfeats, imgsfc7, wordclass_v)
      # 去除無(wú)異議的結(jié)束符和開始符
      wordact = wordact[:,:,:-1]
      wordclass_v = wordclass_v[:,1:]
      mask = mask[:,1:].contiguous()

      wordact_t = wordact.permute(0, 2, 1).contiguous().view(\
        batchsize_cap*(max_tokens-1), -1)
      wordclass_t = wordclass_v.contiguous().view(\
        batchsize_cap*(max_tokens-1), 1)
      # 獲取語(yǔ)句中有意義的部分
      maskids = torch.nonzero(mask.view(-1)).numpy().reshape(-1)

      if(args.attention == True):
        #Cross-entropy損失和注意力的損失
        loss = F.cross_entropy(wordact_t[maskids, ...], \
          wordclass_t[maskids, ...].contiguous().view(maskids.shape[0])) \
          + (torch.sum(torch.pow(1. - torch.sum(attn, 1), 2)))\
          /(batchsize_cap*feat_h*feat_w)
      else:
        loss = F.cross_entropy(wordact_t[maskids, ...], \
          wordclass_t[maskids, ...].contiguous().view(maskids.shape[0]))
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市履怯,隨后出現(xiàn)的幾起案子回还,更是在濱河造成了極大的恐慌,老刑警劉巖叹洲,帶你破解...
    沈念sama閱讀 211,042評(píng)論 6 490
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件懦趋,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡疹味,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 89,996評(píng)論 2 384
  • 文/潘曉璐 我一進(jìn)店門帜篇,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)糙捺,“玉大人,你說(shuō)我怎么就攤上這事笙隙『榈疲” “怎么了?”我有些...
    開封第一講書人閱讀 156,674評(píng)論 0 345
  • 文/不壞的土叔 我叫張陵竟痰,是天一觀的道長(zhǎng)签钩。 經(jīng)常有香客問我,道長(zhǎng)坏快,這世上最難降的妖魔是什么铅檩? 我笑而不...
    開封第一講書人閱讀 56,340評(píng)論 1 283
  • 正文 為了忘掉前任,我火速辦了婚禮莽鸿,結(jié)果婚禮上昧旨,老公的妹妹穿的比我還像新娘拾给。我一直安慰自己,他們只是感情好兔沃,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,404評(píng)論 5 384
  • 文/花漫 我一把揭開白布蒋得。 她就那樣靜靜地躺著,像睡著了一般乒疏。 火紅的嫁衣襯著肌膚如雪额衙。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 49,749評(píng)論 1 289
  • 那天怕吴,我揣著相機(jī)與錄音窍侧,去河邊找鬼。 笑死械哟,一個(gè)胖子當(dāng)著我的面吹牛疏之,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播暇咆,決...
    沈念sama閱讀 38,902評(píng)論 3 405
  • 文/蒼蘭香墨 我猛地睜開眼锋爪,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了爸业?” 一聲冷哼從身側(cè)響起其骄,我...
    開封第一講書人閱讀 37,662評(píng)論 0 266
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎扯旷,沒想到半個(gè)月后拯爽,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,110評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡钧忽,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,451評(píng)論 2 325
  • 正文 我和宋清朗相戀三年毯炮,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片耸黑。...
    茶點(diǎn)故事閱讀 38,577評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡桃煎,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出大刊,到底是詐尸還是另有隱情为迈,我是刑警寧澤,帶...
    沈念sama閱讀 34,258評(píng)論 4 328
  • 正文 年R本政府宣布缺菌,位于F島的核電站葫辐,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏伴郁。R本人自食惡果不足惜耿战,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,848評(píng)論 3 312
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望蛾绎。 院中可真熱鬧昆箕,春花似錦鸦列、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,726評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至纤泵,卻和暖如春骆姐,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背捏题。 一陣腳步聲響...
    開封第一講書人閱讀 31,952評(píng)論 1 264
  • 我被黑心中介騙來(lái)泰國(guó)打工玻褪, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人公荧。 一個(gè)月前我還...
    沈念sama閱讀 46,271評(píng)論 2 360
  • 正文 我出身青樓带射,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親循狰。 傳聞我的和親對(duì)象是個(gè)殘疾皇子窟社,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,452評(píng)論 2 348

推薦閱讀更多精彩內(nèi)容