Implementation of different frameworks for transformer multi attention (tensorflow + Python)

Xixi Moyo 2020-11-19 03:20:14
implementation different frameworks transformer multi

Multiple attention can be described in the following diagram ： 1、 Use pytorch The implementation of the library

The parameters are described as follows ：

• embed_dim： Final output K、Q、V The dimensions of the matrix , This dimension needs to be the same as that of the word vector

• num_heads： Set the number of bull attention . If set to 1, So just use one set of attention . If set to another value , that num_heads The value of needs to be able to be embed_dim to be divisible by

• dropout： This dropout Add to attention score Back

Now let's explain , Why?  num_heads The value of needs to be able to be embed_dim to be divisible by . This is to divide the hidden vector length of words equally into each group , In this way, multiple groups of attention can also be put into a matrix , So that we can compute the attention of multiple heads in parallel .

Definition  MultiheadAttention  After the object of , The parameters passed in when calling are as follows .

• query： Corresponding to Key matrix , The shape is (L,N,E) . among L Is the length of the output sequence ,N yes batch size,E It's the dimension of the word vector

• key： Corresponding to Key matrix , The shape is (S,N,E) . among S Is the length of the input sequence ,N yes batch size,E It's the dimension of the word vector

• value： Corresponding to Value matrix , The shape is (S,N,E) . among S Is the length of the input sequence ,N yes batch size,E It's the dimension of the word vector

• key_padding_mask： If this parameter is provided , Then calculate attention score when , Ignore Key Some of the matrix padding Elements , Do not participate in calculation attention. The shape is (N,S). among N yes batch size,S Is the length of the input sequence .

• If key_padding_mask yes ByteTensor, So no 0 The location of the element is ignored
• If key_padding_mask yes BoolTensor, that  True The corresponding position will be ignored
• attn_mask： When calculating the output , Ignore some places . The shape can be 2D  (L,S), perhaps 3D (N∗numheads,L,S). among L Is the length of the output sequence ,S Is the length of the input sequence ,N yes batch size.

• If attn_mask yes ByteTensor, So no 0 The location of the element is ignored
• If attn_mask yes BoolTensor, that  True The corresponding position will be ignored

It should be noted that ： In practice, ,K、V The sequence length of a matrix is the same , and Q The sequence length of a matrix can be different .

This happens in ： In the decoder part Encoder-Decoder Attention Layer ,Q The matrix comes from the lower layer of the decoder , and K、V The matrix is the output from the encoder .

Code example ：

## nn.MultiheadAttention Enter the first 0 Wei Wei length
# batch_size by 64, Yes 12 Word , Every word of Query The vector is 300 dimension
query = torch.rand(12,64,300)
# batch_size by 64, Yes 10 Word , Every word of Key The vector is 300 dimension
key = torch.rand(10,64,300)
# batch_size by 64, Yes 10 Word , Every word of Value The vector is 300 dimension
value= torch.rand(10,64,300)
embed_dim = 299
# The output is (attn_output, attn_output_weights)
# output: torch.Size([12, 64, 300])
# batch_size by 64, Yes 12 Word , The vector of each word is 300 dimension
print(attn_output.shape)

2、 Manual calculation of long attention

stay PyTorch Provided MultiheadAttention in , The first 1 Dimension is the length of a sentence , The first 2 Weishi batch size. Here, our code implementation , The first 1 Weishi batch size, The first 2 Dimension is the length of a sentence . The code also includes ： How to use matrix to realize parallel computation of multiple groups of attention . There are detailed comments and instructions in the code .

# n_heads： The number of bull attention
# hid_dim： The vector dimension of the output of each word
self.hid_dim = hid_dim
# mandatory hid_dim You have to divide h
assert hid_dim % n_heads == 0
# Definition W_q matrix
self.w_q = nn.Linear(hid_dim, hid_dim)
# Definition W_k matrix
self.w_k = nn.Linear(hid_dim, hid_dim)
# Definition W_v matrix
self.w_v = nn.Linear(hid_dim, hid_dim)
self.fc = nn.Linear(hid_dim, hid_dim)
self.do = nn.Dropout(dropout)
# The zoom
def forward(self, query, key, value, mask=None):
# K: [64,10,300], batch_size by 64, Yes 12 Word , Every word of Query The vector is 300 dimension
# V: [64,10,300], batch_size by 64, Yes 10 Word , Every word of Query The vector is 300 dimension
# Q: [64,12,300], batch_size by 64, Yes 10 Word , Every word of Query The vector is 300 dimension
bsz = query.shape
Q = self.w_q(query)
K = self.w_k(key)
V = self.w_v(value)
# Here is the K Q V The matrix is split into groups of attention , Become a 4 A matrix of dimensions
# The last dimension is to use self.hid_dim // self.n_heads To get , The vector length of each group of attention , Every head The vector length of is ：300/6=50
# 64 Express batch size,6 Express 6 Group attention ,10 Express 10 word ,50 The vector length of the words representing each group's attention
# K: [64,10,300] Split multiple sets of attention -> [64,10,6,50] Transpose to get -> [64,6,10,50]
# V: [64,10,300] Split multiple sets of attention -> [64,10,6,50] Transpose to get -> [64,6,10,50]
# Q: [64,12,300] Split multiple sets of attention -> [64,12,6,50] Transpose to get -> [64,6,12,50]
# Transpose is to put the amount of attention 6 Put it in front , hold 10 and 50 Put it in the back , It is convenient to calculate
Q = Q.view(bsz, -1, self.n_heads, self.hid_dim //
K = K.view(bsz, -1, self.n_heads, self.hid_dim //
V = V.view(bsz, -1, self.n_heads, self.hid_dim //
# The first 1 Step ：Q multiply K The transpose , Divide scale
# [64,6,12,50] * [64,6,50,10] = [64,6,12,10]
# attention：[64,6,12,10]
attention = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
# hold mask Not empty , Then put mask by 0 The location of attention The score is set to -1e10
# The first 2 Step ： Calculate the result of the previous step softmax, after dropout, obtain attention.
# Be careful , Here's the last dimension softmax, That is, in the dimension of the input sequence softmax
# attention: [64,6,12,10]
attention = self.do(torch.softmax(attention, dim=-1))
# The third step ,attention Results and V Multiply , The result of long attention
# [64,6,12,10] * [64,6,10,50] = [64,6,12,50]
# x: [64,6,12,50]
x = torch.matmul(attention, V)
# because query Yes 12 Word , So the 12 Put it in front , hold 5 and 60 Put it in the back , To facilitate the following splicing of multiple groups of results
# x: [64,6,12,50] Transposition -> [64,12,6,50]
x = x.permute(0, 2, 1, 3).contiguous()
# The matrix transformation here is ： Put together the results of multiple sets of attention
# The end result is [64,12,300]
# x: [64,12,6,50] -> [64,12,300]
x = self.fc(x)
return x
# batch_size by 64, Yes 12 Word , Every word of Query The vector is 300 dimension
query = torch.rand(64, 12, 300)
# batch_size by 64, Yes 12 Word , Every word of Key The vector is 300 dimension
key = torch.rand(64, 10, 300)
# batch_size by 64, Yes 10 Word , Every word of Value The vector is 300 dimension
value = torch.rand(64, 10, 300)
output = attention(query, key, value)
## output: torch.Size([64, 12, 300])
print(output.shape)

3、tensorflow Realized long attention

# rawKeys The function of is to calculate mask Time use , because keys Yes, with position embedding Of , There is no such thing as padding by 0 Value
# numUnits = 50

keepProb = 1
if numUnits is None: # If there is no input value , Go straight to the last dimension of the data , namely embedding size.
numUnits = queries.get_shape().as_list()[-1] #300
# tf.layers.dense You can do multidimensional tensor Nonlinear mapping of data , In the calculation self-Attention when , It's important to map these three values nonlinearly ,
# In fact, this step is in the paper Multi-Head Attention The steps of weight mapping for the segmented data in , Here we map first and then divide , In principle, it's the same .
# Q, K, V The dimensions of are all [batch_size, sequence_length, embedding_size]
Q = tf.layers.dense(queries, numUnits, activation=tf.nn.relu) # [64,10,300]
K = tf.layers.dense(keys, numUnits, activation=tf.nn.relu) # [64,10,300]
V = tf.layers.dense(keys, numUnits, activation=tf.nn.relu) # [64,10,300]
# Divide the data into the last dimension num_heads individual , And then we put together the first dimension
# Q, K, V The dimensions of are all [batch_size * numHeads, sequence_length, embedding_size/numHeads]
Q_ = tf.concat(tf.split(Q, numHeads, axis=-1), axis=0) # [64*6,10,50]
K_ = tf.concat(tf.split(K, numHeads, axis=-1), axis=0) # [64*6,10,50]
V_ = tf.concat(tf.split(V, numHeads, axis=-1), axis=0) # [64*6,10,50]
# Calculation keys and queries Dot product between , dimension [batch_size * numHeads, queries_len, key_len], The last two dimensions are queries and keys The sequence length of
similary = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # [64*6,10,10]
# Scale the calculated point product , The length of the root vector divided by
scaledSimilary = similary / (K_.get_shape().as_list()[-1] ** 0.5) # [64*6,10,10]
# There will be... In the sequence we enter padding This kind of filler , This word should not help the end result , In principle padding It's all input 0 when ,
# The calculated weight should be the same 0, But in transformer Position vector is introduced in , When you add it to the position vector , Its value is not 0 了 , So add the position vector
# Before , We need to mask by 0. Although in queries There are also such filler words in , But in principle, the results of the model are related to the input , And in self-Attention in
# queryies = keys, So as long as one party is 0, The calculated weight is 0.
# utilize tf,tile Tensor expansion , dimension [batch_size * numHeads, keys_len] keys_len = keys The sequence length of
# Add the values of the vectors in each time series to get the average
# rawkKeys:[64,10,300]
keyMasks = tf.sign(tf.abs(tf.reduce_sum(rawKeys, axis=-1))) # dimension [batch_size, time_step] [64,10]
#tf.sign() Yes, it will <0 The value of a -1, Greater than 0 The value of a 1, be equal to 0 The value of a 0
# Find out padding The location of
# Add a dimension , And expand , Get dimensions [batch_size * numHeads, queries_len, keys_len]
keyMasks = tf.tile(tf.expand_dims(keyMasks, 1), [1, tf.shape(queries), 1]) # [64*6,10,10] 10 A for 1 Group
# tf.ones_like The generating elements are all 1, Dimensions and scaledSimilary same tensor, And then we get the value of negative infinity
paddings = tf.ones_like(scaledSimilary) * (-2 ** (32 + 1)) [64*6,10,10]
# tf.where(condition, x, y),condition The element in is bool value , Which corresponds to True use x Element replacement in , Corresponding False use y Element replacement in
# therefore condition,x,y The dimensions of are the same . The following is keyMasks The value of 0 Just use paddings Replace the value in
# When calculating the current word , Just consider the above , Don't consider the following , Appear in the Transformer Decoder in . In text categorization , It can be used only Transformer Encoder.
# Decoder It's a generative model , It is mainly used in language generation
if causality:
diagVals = tf.ones_like(maskedSimilary[0, :, :]) # [queries_len, keys_len]
tril = tf.contrib.linalg.LinearOperatorTriL(diagVals).to_dense() # [queries_len, keys_len]

# adopt softmax Calculate the weight coefficient , dimension [batch_size * numHeads, queries_len, keys_len]
# Weighted sum to get the output value , dimension [batch_size * numHeads, sequence_length, embedding_size/numHeads]
outputs = tf.matmul(weights, V_)
# Will be long Attention The calculated output reconstitutes the original dimension [batch_size, sequence_length, embedding_size]
outputs = tf.concat(tf.split(outputs, numHeads, axis=0), axis=2)
outputs = tf.nn.dropout(outputs, keep_prob=keepProb)
# For each subLayers Establish residual link , namely H(x) = F(x) + x
outputs += queries
# normalization layer
#outputs = self._layerNormalization(outputs)
return outputs

Input is ：self.embeddedWords = self.wordEmbedded + self.positionEmbedded, That is, word embedding + Position insertion

Or to pytorch For example, the input dimension of ：self.wordEmbedded Dimensions [64,10,300] self.positionEmbedded The dimension of is [64,10,300]

When you use it, it's ：

keys=self.embeddedWords)

for example ：（ This simplifies the input ）

wordEmbedded = tf.Variable(np.ones((64,10,300)))
positionEmbedded = tf.Variable(np.ones((64,10,300)))
embeddedWords = wordEmbedded + positionEmbedded 