就像张俊林老师BERT提到的, 要想理解BERT, Transformer 是必须要知道的,而且Transformer有代替RNN,CNN成为NLP中STOA的特征提取器。说Transformer 是STOA的特征提取器的原因主要有两点 + 效果好: Transformer > CNN > RNN + 速度快: Transformer \(\approx\) CNN > RNN 而本文基本是对 The Annotated Transformer 的中文翻译(-_-)
所需要的安装包
pytorch, torchtext(是的, 是基于PyTorch的实现), pytorch 用起来还是挺方便的。
话不多说,还是先看一下我们的模型的总体结构。博客、论文看再多遍,不如亲自动手实现一遍,我这里从代码实现的角度来讲一下如何实现Attention is All You Need 这篇论文。(注意这里是从代码实现角度入手的哈,想要对Transformer 从原理上理解,可以参考我的这篇博文)
先看一下整体的Attention is All You Need 提出的模型结构,我们当然可以很容易的回答有2部分: Encoder, Decoder, 但是坦白说,这对模型实现帮助有限,我们如果按照功能划分,可以将整个模型分为4类功能模块 + Positional Encoding 模块 它的输入是一个向量,输出也是一个向量,但是输出的向量中已经包含了位置信息,好了我们就定义实现这个功能的类叫做PositionalEncoding + Multi-Head Attention + Add & Norm 模块 我们可以看到接下来将要出场的Feed Forward + Add & Norm 也有一个Add & Norm 操作,按照面向对象编程的思想,我们应该封装 Add & Norm 这个模块我们就命名为SublayerConnection 而对于Multi-Head Attention, 我们可以将其通过通过 MultiHeadedAttention 这个类来实现
- Feed Forward + Add & Norm 模块 对于此时我们可以分别通过一个PositionWiseFeedForward 类和一个 SublayerConnection 类来实现
- Generator 模块 Generator 其实是Linear + Softmax 两个组成的
分析完了按照功能分类之后的,我们按照结构看看能将其以下4大部分,我们称之为 组件 + Position Encoding 组件 没什么好讲的,实现就完了 + Encoder 组件 而Encoder 组件可以看成若干个(论文中取 N=6) 相同结构的EncoderLayer 组成的 + Decoder 组件 Decoder 组件可以看成是若干个(论文中取 N=6) 相同结构的 DecoderLayer 组成的 + Generator 组件 实现就完了,没什么好讲的
接下来就是紧张的实现环节,我们先从功能模块入手,逐步实现各个功能模块 ### 实现Positional Encoding 大家想一想这样完全基于Attention(Self Attention + Encoder-Decoder Attention)目前还存在什么问题 呐? Deadline 到了,想到了吗? 没有顺序序列, 这样说吧,对于 “This book is worthy reading”, 和 “Book reading worthy is this” 在Encoder 表示中是完全一致的,这显然不对,为了像RNN,CNN那样引入位置信息,我们必须提供一种加入位置信息的方法 其实,有两种方案可以选择 \[
\begin{cases}
PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\text{model}}}) \\\\
PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\text{model}}})
\end{cases}
\] 还记得初中学习的三角函数吗, \(sin(k \cdot x)\), 如果我们认为\(k = 1 / 10000^{2i/d_{\text{model}}}\), 那么可以认为一个序列从前到后分别对应一天正弦曲线的不同的点,对于\(max\_len = 5000\) 的情况。而对于k 来讲,我们考虑两种极端情况 + 当\(i = 0\) 时, 则\(k = 1\), 也就是说此时弧长(wavelength) 为\(2\pai\), 对于pos 从[1,2,\(\cdots\), 5000], 显然可以有多个循环 + 当\(i = d_{\text{model-1}}\), 弧长近似 为\(10000 * 2\pai\), 对于pos 从[1,2,\(\cdots\), 5000], 显然一个循环也到不了。 注意其实现方法,是与原有input embedding 相加,也就是说Positional Encoding 也是\(d_{\text{model}}\) 的,因此两项可以做point wise的相加操作。 注意啊,这里是单纯的向量相加(Pointwise Add),而不是拼接(Concatentation), 1
2
3def forward(self, x):
x = x + Variable(self.pe[:, :x.size(1)], require_grad=False)
return self.dropout(x)
size(x) : (n_batch, max_len, d_model) size(pe) : (max_len, d_model) 这里还是需要稍微区分一下, size(x) 中的max_len 是真实文本定义的max_len的,譬如可能是 n_batchf = 128, max_len = 50, d_model = 512, 也就是这一个batch中有 128 个句子(samples),每个句子(sample)长度(sentence lenght)最长不超过 50, 而对于每一个句中每一个词语(word),都用一个512维的向量表示。 而对于size(pe)中的max_len 是positional embedding, 会流出一些余度来,因此这里的pe 不是直接加到 x 上,而是取出其部分,也即pe[:, :x.size(1)]
实现SublayerConnection
我们来看一下论文里 Section3.1 中Encoder 部分是如何介绍的: > We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sublayer is LayerNorm(x + Sublayer(x)).
但其实我认为 这个 LayerNorm(x + Sublayer(x)) 太粗狂了,不利于理解计算过程究竟是怎样的,不利于写代码实现 我们不妨思考一下 residual connection 指的是什么, 譬如一个网络的输入是: x, 而这个网络我们用 sublayer 表示,那么residual connection 的意思是: y = x + sublayer(x)
接下来,那么我们看起来只要在定义SublayerConnection的时候把 Sublayer 也作为一个参数传进去就好 1
2
3
4class SublayerConnection(nn.Module):
# TO DO
def forward(self, x, sublayer):
# TO DO
可是,我们这里用到了sublayer(x), 可以提前剧透一下,我们的sublayer 不是self attentioin 就是 point-wise feed forward, 都是参数很多的DNN网络, 而对于DNN来说,为了保证模型稳步训练(Normalization)和防止过拟合(Regularization),我们采用了以下tricks + Layer Normalization: 为了解决Internal Covariate Shift 问题,需要进行Layer Normalization. 如果对Normalization 有兴趣,可以参考这个博文Normalization, 我们一般是对输入进行 Layer Normalization 也就是对输入x 进行Layer Normalization + Regularization: 为了防止过拟合,我们的正则化方案是 采用 dropout 机制也就是对输出进行 dropout(sublayer(norm(x)))
好了,那么可以开始实现这个SublayerConnection类了,兴奋地搓手手 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29"""
input: x
SublayerConnection is the operation about(orderly):
1. Layernormalize the input
2. Sublayer function
3. Dropout: See Section 5.4 Regularization: Residual Dropout
<We apply dropout to the output of each sub-layer, before it it is added to the
sublayer input normalized>
Actually, The add action between the output of each sub-layer and the sub-layer
input is known as Residual Connection
4. Residual Connection
"""
class SublayerConnection(nn.Module):
def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = LayerNorm(size)
self.dropout = nn.Dropout(dropout)
# sublayer is a function defined by self attention or feed forward
def forward(self, x, sublayer):
# Normalization
norm_x = self.norm(x)
# Sublayer function
sublayer_x = sublayer(norm_x)
# Dropout function
dropout_x = self.dropout(x)
# Residual connection
return x + dropout_x
注意,我们又引入了一个新的模块 Layer Normalization, 实现起来很简单 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16"""
LayerNorm is the implement code of Layer Normalization(LN) for each of two sublayers
For more information about Layer Normalization and Normalization, please see my Blog
"""
class LayerNorm(nn.Module):
def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__()
# a_2, b_2 is trainable to scale means and std variance
self.a_2 = nn.Parameter(torch.ones(features))
self.b_2 = nn.Parameter(torch.ones(features))
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
实现Multi-Head Attention模块
每次看到谷歌对于Attention 的解读都情不自禁的让我想把它打出来, 实在是简明扼要,抓住中心 > An attention function can be described as mapping a query and a set of key-value pairst to an output, where the query, keys, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
既然如此,我们先来实现attention function, 注意谷歌这里提出了一个叫做 Scaled Dot-Product Attention 机制 \[ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V \] 为什么要除以一个\(\sqrt{d_k}\)呢? 论文给的解释是 > We suspect that for large values of \(d_k\), the dot products grow large in magnitude, pushing the softmax function into reginos where it has extremely small gradient. To counteract this effect, we scale the dot product by \(\frac{1}{\sqrt{d_k}}\).
也就是论文认为,当维度过高时,会导致\(QK^T\)的每一个元素都很大,那么softmax之后很容易陷入在某些饱和区间,导致gradient 十分的小,从而导致模型训练较慢。 好啦,既然原理都懂了,那么是时候实现了,实现很简单, 但是炼丹练久了,需要对这个形式格外留心,这个简单的 \(QK^T\) 表示什么意思呢? 1
2
3
4
5
6
7
8
9
10
11def attention(query, key, value, mask=None, dropout=None):
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim=-1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
接下来是Attention is All You Need 的重头戏, Multi-Head Attention 有什么好处呢? > Multi-Head attention allows the model to jointly attend to information from different representation subspace at different positions.
简要来讲,我们把一个大的Attention(也就是\(d_{\text{model}}\)) 拆解成为若干低纬度的Attention, 我们先看一下公式: \[ \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head_1}, ..., \mathrm{head_h})W^O \\ \text{where}~\mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i) \] Where the projections are parameter matrices \(W^Q_i \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W^V_i \in \mathbb{R}^{d_{\text{model}} \times d_v}\) and \(W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}\)
实现起来也很直观 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48"""
h is the number of the parallel attention layers. In paper, h=8
"""
class MultiHeadAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
super(MultiHeadAttention, self).__init__()
#print d_model, h
assert d_model % h == 0
# self.d_k is the reduced dimension of each parallel attention
self.d_k = d_model // h
self.h = h
# self.linears is a list consists of 4 projection layers
# self.linears[0]: Concat(W^Q_i), where i \in [1,...,h].
# self.linears[1]: Concat(W^K_i), where i \in [1,...,h].
# self.linears[2]: Concat(W^K_i), where i \in [1,...,h].
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
# query.size() = key.size() = value.size() = (batch_size, max_len, d_model)
if mask is not None:
mask = mask.unsqueeze(1)
batch_size = query.size(0)
"""
do all the linear projection, after this operation
query.size() = key.size() = value.size() = (batch_size, self.h, max_len, self.d_k)
"""
query, key, value = \
[linear(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2) for
linear, x in zip(self.linears, (query, key, value))]
"""
x.size(): (batch_size, h, max_len, d_v)
self.attn.size(): (batch_size, h, max_len, d_v)
"""
x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)
"""
x.transpose(1,2).size(): (batch_size, max_len, h, d_v)
the transpose operation is necessary
x.size: (batch_size, max_len, h*d_v)
"""
x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)
# self.linears[-1] \in R^{hd_v \times d_{model}}
return self.linears[-1](x)
实现Position-wise Feed-Forward 组件
整体比较简单,直接看公式吧 \[ \mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2 \]
1 | class PositionwiseFeedForward(nn.Module): |
实现Generator 组件
1 | class Generator(nn.Module): |
结构组件
PositionWise Feed Forward Network
参见功能组件介绍 ### Encoder组件 Encoder 有N=6 个EncoderLayer 组成,废话不多说,开始动手实现这两个类吧 1
2
3
4
5
6
7
8
9
10
11class Encoder(nn.Module):
def __init__(self, layer, N):
super(Encoder, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)
def forward(self, x, mask):
for layer in self.layers:
x = layer(x, mask)
memory = self.norm(x)
return memory
注意啊: 我们返回的memory 在通过各个EncoderLayer 之后还需要进行一次 Layer Normalization. 这是因为memory 会作为src-tgt 的Self Attention Layer 的Key, Value 值,因此Layer Normalization 是必须的。 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21"""
EncoderLayer is the single piece layer in Encoder;
each EncoderLayer is composed of two sublayer
1. self_attention
2. feed forward
"""
class EncoderLayer(nn.Module):
def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
# self.self_attn is the Multi-Head Attention Layer
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayerconnections = clones(SublayerConnection(size, dropout), 2)
self.size = size
def forward(self, x, mask):
# self attention
self_attention_x = self.sublayerconnections[0](x, lambda x: self.self_attn(x, x, x, mask))
# feed forward
feed_forward_x = self.sublayerconnections[1](self_attention_x, self.feed_forward)
return feed_forward_x
Decoder 组件
没有复杂的逻辑,直接实现 1
2
3
4
5
6
7
8
9
10class Decoder(nn.Module):
def __init__(self, layer, N):
super(Decoder, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)
def forward(self, x, memory, src_mask, tgt_mask):
for layer in self.layers:
x = layer(x, memory, src_mask, tgt_mask)
return self.norm(x)
同理, 因为decoder的输出是作为Generator 的输入存在的,因此最后仍需要 normalization. ### 整体网络 ### Regularization 技巧 + Residual Dropout + Label Smoothing: 在输出y中添加噪声, 从而避免过拟合