跟着教程实习Transformer代码

就像张俊林老师BERT提到的, 要想理解BERT, Transformer 是必须要知道的,而且Transformer有代替RNN,CNN成为NLP中STOA的特征提取器。说Transformer 是STOA的特征提取器的原因主要有两点 + 效果好: Transformer > CNN > RNN + 速度快: Transformer \(\approx\) CNN > RNN 而本文基本是对 The Annotated Transformer 的中文翻译(-_-)

所需要的安装包

pytorch, torchtext(是的, 是基于PyTorch的实现), pytorch 用起来还是挺方便的。

话不多说,还是先看一下我们的模型的总体结构。博客、论文看再多遍，不如亲自动手实现一遍，我这里从代码实现的角度来讲一下如何实现Attention is All You Need 这篇论文。（注意这里是从代码实现角度入手的哈，想要对Transformer 从原理上理解，可以参考我的这篇博文）

先看一下整体的Attention is All You Need 提出的模型结构，我们当然可以很容易的回答有2部分: Encoder, Decoder, 但是坦白说，这对模型实现帮助有限，我们如果按照功能划分，可以将整个模型分为4类功能模块 + Positional Encoding 模块它的输入是一个向量,输出也是一个向量,但是输出的向量中已经包含了位置信息，好了我们就定义实现这个功能的类叫做PositionalEncoding + Multi-Head Attention + Add & Norm 模块我们可以看到接下来将要出场的Feed Forward + Add & Norm 也有一个Add & Norm 操作，按照面向对象编程的思想，我们应该封装 Add & Norm 这个模块我们就命名为SublayerConnection 而对于Multi-Head Attention，我们可以将其通过通过 MultiHeadedAttention 这个类来实现

Feed Forward + Add & Norm 模块对于此时我们可以分别通过一个PositionWiseFeedForward 类和一个 SublayerConnection 类来实现
Generator 模块 Generator 其实是Linear + Softmax 两个组成的

分析完了按照功能分类之后的，我们按照结构看看能将其以下4大部分，我们称之为组件 + Position Encoding 组件没什么好讲的，实现就完了 + Encoder 组件而Encoder 组件可以看成若干个(论文中取 N=6) 相同结构的EncoderLayer 组成的 + Decoder 组件 Decoder 组件可以看成是若干个(论文中取 N=6) 相同结构的 DecoderLayer 组成的 + Generator 组件实现就完了，没什么好讲的

接下来就是紧张的实现环节，我们先从功能模块入手，逐步实现各个功能模块 ### 实现Positional Encoding 大家想一想这样完全基于Attention(Self Attention + Encoder-Decoder Attention)目前还存在什么问题呐？ Deadline 到了，想到了吗？没有顺序序列, 这样说吧，对于 “This book is worthy reading”, 和 “Book reading worthy is this” 在Encoder 表示中是完全一致的，这显然不对，为了像RNN,CNN那样引入位置信息，我们必须提供一种加入位置信息的方法其实，有两种方案可以选择 \[ \begin{cases} PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\text{model}}}) \\\\ PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\text{model}}}) \end{cases} \] 还记得初中学习的三角函数吗, \(sin(k \cdot x)\), 如果我们认为\(k = 1 / 10000^{2i/d_{\text{model}}}\), 那么可以认为一个序列从前到后分别对应一天正弦曲线的不同的点,对于\(max\_len = 5000\) 的情况。而对于k 来讲，我们考虑两种极端情况 + 当\(i = 0\) 时, 则\(k = 1\), 也就是说此时弧长(wavelength) 为\(2\pai\), 对于pos 从[1,2,\(\cdots\), 5000], 显然可以有多个循环 + 当\(i = d_{\text{model-1}}\), 弧长近似为\(10000 * 2\pai\), 对于pos 从[1,2,\(\cdots\), 5000], 显然一个循环也到不了。注意其实现方法，是与原有input embedding 相加，也就是说Positional Encoding 也是\(d_{\text{model}}\) 的，因此两项可以做point wise的相加操作。注意啊，这里是单纯的向量相加(Pointwise Add)，而不是拼接(Concatentation),

1
2
3

def forward(self, x):
    x = x + Variable(self.pe[:, :x.size(1)], require_grad=False)
    return self.dropout(x)

size(x) : (n_batch, max_len, d_model) size(pe) : (max_len, d_model) 这里还是需要稍微区分一下, size(x) 中的max_len 是真实文本定义的max_len的,譬如可能是 n_batchf = 128, max_len = 50, d_model = 512, 也就是这一个batch中有 128 个句子(samples)，每个句子(sample)长度(sentence lenght)最长不超过 50, 而对于每一个句中每一个词语(word)，都用一个512维的向量表示。而对于size(pe)中的max_len 是positional embedding, 会流出一些余度来，因此这里的pe 不是直接加到 x 上，而是取出其部分,也即pe[:, :x.size(1)]

实现SublayerConnection

我们来看一下论文里 Section3.1 中Encoder 部分是如何介绍的: > We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sublayer is LayerNorm(x + Sublayer(x)).

但其实我认为这个 LayerNorm(x + Sublayer(x)) 太粗狂了，不利于理解计算过程究竟是怎样的，不利于写代码实现我们不妨思考一下 residual connection 指的是什么, 譬如一个网络的输入是: x, 而这个网络我们用 sublayer 表示,那么residual connection 的意思是: y = x + sublayer(x)

接下来，那么我们看起来只要在定义SublayerConnection的时候把 Sublayer 也作为一个参数传进去就好

class SublayerConnection(nn.Module):
    # TO DO
    def forward(self, x, sublayer):
        # TO DO

可是，我们这里用到了sublayer(x), 可以提前剧透一下，我们的sublayer 不是self attentioin 就是 point-wise feed forward, 都是参数很多的DNN网络，而对于DNN来说，为了保证模型稳步训练(Normalization)和防止过拟合(Regularization)，我们采用了以下tricks + Layer Normalization: 为了解决Internal Covariate Shift 问题，需要进行Layer Normalization. 如果对Normalization 有兴趣，可以参考这个博文Normalization, 我们一般是对输入进行 Layer Normalization 也就是对输入x 进行Layer Normalization + Regularization: 为了防止过拟合，我们的正则化方案是采用 dropout 机制也就是对输出进行 dropout(sublayer(norm(x)))

好了，那么可以开始实现这个SublayerConnection类了，兴奋地搓手手

"""
input: x
SublayerConnection is the operation about(orderly): 
    1. Layernormalize the input
    2. Sublayer function
    3. Dropout: See Section 5.4 Regularization: Residual Dropout
                <We apply dropout to the output of each sub-layer, before it it is added to the 
                sublayer input normalized>
                
                Actually, The add action between the output of each sub-layer and the sub-layer 
                input is known as Residual Connection
    4. Residual Connection
"""
class SublayerConnection(nn.Module):
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
    
    # sublayer is a function defined by self attention or feed forward 
    def forward(self, x, sublayer):
        # Normalization
        norm_x = self.norm(x)
        # Sublayer function
        sublayer_x = sublayer(norm_x)
        # Dropout function
        dropout_x = self.dropout(x)
        # Residual connection
        return x + dropout_x

注意，我们又引入了一个新的模块 Layer Normalization, 实现起来很简单

"""
LayerNorm is the implement code of Layer Normalization(LN) for each of two sublayers 
For more information about Layer Normalization and Normalization, please see my Blog
"""
class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        # a_2, b_2 is trainable to scale means and std variance
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.ones(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

实现Multi-Head Attention模块

每次看到谷歌对于Attention 的解读都情不自禁的让我想把它打出来, 实在是简明扼要，抓住中心 > An attention function can be described as mapping a query and a set of key-value pairst to an output, where the query, keys, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

既然如此，我们先来实现attention function, 注意谷歌这里提出了一个叫做 Scaled Dot-Product Attention 机制 \[ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V \] 为什么要除以一个\(\sqrt{d_k}\)呢? 论文给的解释是 > We suspect that for large values of \(d_k\), the dot products grow large in magnitude, pushing the softmax function into reginos where it has extremely small gradient. To counteract this effect, we scale the dot product by \(\frac{1}{\sqrt{d_k}}\).

也就是论文认为，当维度过高时，会导致\(QK^T\)的每一个元素都很大，那么softmax之后很容易陷入在某些饱和区间，导致gradient 十分的小，从而导致模型训练较慢。好啦,既然原理都懂了，那么是时候实现了，实现很简单, 但是炼丹练久了，需要对这个形式格外留心，这个简单的 \(QK^T\) 表示什么意思呢？

def attention(query, key, value, mask=None, dropout=None):
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    p_attn = F.softmax(scores, dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)

    return torch.matmul(p_attn, value), p_attn

接下来是Attention is All You Need 的重头戏, Multi-Head Attention 有什么好处呢？ > Multi-Head attention allows the model to jointly attend to information from different representation subspace at different positions.

简要来讲,我们把一个大的Attention(也就是\(d_{\text{model}}\)) 拆解成为若干低纬度的Attention, 我们先看一下公式: \[ \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head_1}, ..., \mathrm{head_h})W^O \\ \text{where}~\mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i) \] Where the projections are parameter matrices \(W^Q_i \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W^V_i \in \mathbb{R}^{d_{\text{model}} \times d_v}\) and \(W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}\)

实现起来也很直观

"""
h is the number of the parallel attention layers. In paper, h=8
"""
class MultiHeadAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        super(MultiHeadAttention, self).__init__()
        #print d_model, h
        assert d_model % h == 0
        
        # self.d_k is the reduced dimension of each parallel attention
        self.d_k = d_model // h
        self.h = h

        # self.linears is a list consists of 4 projection layers
        # self.linears[0]: Concat(W^Q_i), where i \in [1,...,h]. 
        # self.linears[1]: Concat(W^K_i), where i \in [1,...,h]. 
        # self.linears[2]: Concat(W^K_i), where i \in [1,...,h]. 
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        # query.size() = key.size() = value.size() = (batch_size, max_len, d_model)
        if mask is not None:
            mask = mask.unsqueeze(1)
        batch_size = query.size(0)

        """
        do all the linear projection, after this operation
        query.size() = key.size() = value.size() = (batch_size, self.h, max_len, self.d_k)
        """
        query, key, value = \
                [linear(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2) for 
                        linear, x in zip(self.linears, (query, key, value))]
        """
        x.size(): (batch_size, h, max_len, d_v)
        self.attn.size(): (batch_size, h, max_len, d_v)
        """
        x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)
        """
        x.transpose(1,2).size(): (batch_size, max_len, h, d_v)
        the transpose operation is necessary
        x.size: (batch_size, max_len, h*d_v)
        """
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)

        # self.linears[-1] \in R^{hd_v \times d_{model}}
        return self.linears[-1](x)

实现Position-wise Feed-Forward 组件

整体比较简单，直接看公式吧 \[ \mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2 \]

class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        w1_x = self.w_1(x)
        relu_x = F.relu(w1_x)
        dropout_x = self.dropout(relu_x)
        return self.w_2(dropout_x)

实现Generator 组件

class Generator(nn.Module):
    # Generator = Linear Layer(Projection) + Softmax Layer(for output the probability distribution of each words in vocubulary)

    def __init__(self, dimen_model, dimen_vocab):
        # dimen_model: the dimension of decoder output
        # dimen_vocab: the dimension of vocabulary 
        super(Generator, self).__init__()
        self.projection_layer = nn.Linear(dimen_model, dimen_vocab)

    def forward(self, x):
        # input.size(): (batch_size, max_len, d_model)
        # output.size(): (batch_size, max_len, d_vocab)
        return F.log_softmax(self.projection_layer(x), dim=-1)

结构组件

PositionWise Feed Forward Network

参见功能组件介绍 ### Encoder组件 Encoder 有N=6 个EncoderLayer 组成,废话不多说,开始动手实现这两个类吧

class Encoder(nn.Module):
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        memory = self.norm(x)
        return memory

注意啊: 我们返回的memory 在通过各个EncoderLayer 之后还需要进行一次 Layer Normalization. 这是因为memory 会作为src-tgt 的Self Attention Layer 的Key, Value 值，因此Layer Normalization 是必须的。

"""
EncoderLayer is the single piece layer in Encoder;
each EncoderLayer is composed of two sublayer
1. self_attention
2. feed forward
"""
class EncoderLayer(nn.Module):
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        # self.self_attn is the Multi-Head Attention Layer
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayerconnections = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        # self attention
        self_attention_x = self.sublayerconnections[0](x, lambda x: self.self_attn(x, x, x, mask))
        # feed forward
        feed_forward_x = self.sublayerconnections[1](self_attention_x, self.feed_forward)
        return feed_forward_x

Decoder 组件

没有复杂的逻辑，直接实现

class Decoder(nn.Module):
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

同理, 因为decoder的输出是作为Generator 的输入存在的，因此最后仍需要 normalization. ### 整体网络 ### Regularization 技巧 + Residual Dropout + Label Smoothing: 在输出y中添加噪声, 从而避免过拟合