Implementation of different frameworks for transformer multi attention (tensorflow + Python)

implementation different frameworks transformer multi

Multiple attention can be described in the following diagram ：

1、 Use pytorch The implementation of the library

The arguments are as follows ：

• embed_dim： The final output of K、Q、V Dimension of matrix , This dimension needs to be the same as that of the word vector

• num_heads： Set the number of bull attention . If set to 1, So just use one set of attention . If set to another value , So num_heads The value of needs to be able to be embed_dim to be divisible by

• dropout： This dropout Add to attention score Behind

Now let's explain , Why?  num_heads The value of needs to be able to be embed_dim to be divisible by . This is to divide the hidden vector length of words equally into each group , In this way, multiple groups of attention can also be put into a matrix , So we can calculate the attention of the bulls in parallel .

Define  MultiheadAttention  After the object , The arguments passed in during the call are as follows .

• query： Corresponding to Key Matrix , The shape is (L,N,E) . among L Is the length of the output sequence ,N yes batch size,E It's the dimension of a word vector

• key： Corresponding to Key Matrix , The shape is (S,N,E) . among S Is the length of the input sequence ,N yes batch size,E It's the dimension of a word vector

• value： Corresponding to Value Matrix , The shape is (S,N,E) . among S Is the length of the input sequence ,N yes batch size,E It's the dimension of a word vector

• key_padding_mask： If you provide this argument , So calculate attention score When , Ignore Key Some of the matrix padding Elements , Not involved in the calculation attention. The shape is (N,S). among N yes batch size,S Is the length of the input sequence .

• If key_padding_mask yes ByteTensor, So it's not 0 The location of the element is ignored
• If key_padding_mask yes BoolTensor, So  True The corresponding position will be ignored
• attn_mask： When calculating the output , Ignore some places . The shape can be 2D  (L,S), perhaps 3D (N∗numheads,L,S). among L Is the length of the output sequence ,S Is the length of the input sequence ,N yes batch size.

• If attn_mask yes ByteTensor, So it's not 0 The location of the element is ignored
• If attn_mask yes BoolTensor, So  True The corresponding position will be ignored

It should be noted that ： In practice ,K、V The sequence length of a matrix is the same , and Q The sequence length of a matrix can be different .

This happens in ： In the decoder part Encoder-Decoder Attention In the layer ,Q The matrix comes from the lower layer of the decoder , and K、V The matrix is the output from the encoder .

Code samples ：

## nn.MultiheadAttention Enter number 0 Dimension is length
# batch_size For 64, Yes 12 A word , For each word Query The vector is 300 Vi
query = torch.rand(12,64,300)
# batch_size For 64, Yes 10 A word , For each word Key The vector is 300 Vi
key = torch.rand(10,64,300)
# batch_size For 64, Yes 10 A word , For each word Value The vector is 300 Vi
value= torch.rand(10,64,300)
embed_dim = 299
# The output is (attn_output, attn_output_weights)
# output: torch.Size([12, 64, 300])
# batch_size For 64, Yes 12 A word , The vector of each word is 300 Vi
print(attn_output.shape)

2、 Manual calculation of long attention

stay PyTorch Provided MultiheadAttention in , The first 1 Dimension is the length of a sentence , The first 2 Weishi batch size. Here, our code implementation , The first 1 Weishi batch size, The first 2 Dimension is the length of a sentence . The code also includes ： How to use matrix to realize parallel computation of multiple groups of attention . There are detailed comments and instructions in the code .

# n_heads： The number of bulls' attention
# hid_dim： The vector dimension of the output of each word
self.hid_dim = hid_dim
# Force hid_dim Must divide h
assert hid_dim % n_heads == 0
# Define W_q Matrix
self.w_q = nn.Linear(hid_dim, hid_dim)
# Define W_k Matrix
self.w_k = nn.Linear(hid_dim, hid_dim)
# Define W_v Matrix
self.w_v = nn.Linear(hid_dim, hid_dim)
self.fc = nn.Linear(hid_dim, hid_dim)
self.do = nn.Dropout(dropout)
# Zoom
def forward(self, query, key, value, mask=None):
# K: [64,10,300], batch_size For 64, Yes 12 A word , For each word Query The vector is 300 Vi
# V: [64,10,300], batch_size For 64, Yes 10 A word , For each word Query The vector is 300 Vi
# Q: [64,12,300], batch_size For 64, Yes 10 A word , For each word Query The vector is 300 Vi
bsz = query.shape
Q = self.w_q(query)
K = self.w_k(key)
V = self.w_v(value)
# Here, put K Q V The matrix is split into groups of attention , Become a 4 A matrix of dimensions
# The last dimension is to use self.hid_dim // self.n_heads To get , The vector length of each group of attention , Every head The vector length of is ：300/6=50
# 64 Express batch size,6 Express 6 Group attention ,10 Express 10 Word ,50 The vector length of the words representing each group's attention
# K: [64,10,300] Split multiple sets of attention -> [64,10,6,50] Transpose to get -> [64,6,10,50]
# V: [64,10,300] Split multiple sets of attention -> [64,10,6,50] Transpose to get -> [64,6,10,50]
# Q: [64,12,300] Split multiple sets of attention -> [64,12,6,50] Transpose to get -> [64,6,12,50]
# Transpose is to put the amount of attention 6 Put it in front , hold 10 and 50 Put it in the back , It is convenient to calculate
Q = Q.view(bsz, -1, self.n_heads, self.hid_dim //
K = K.view(bsz, -1, self.n_heads, self.hid_dim //
V = V.view(bsz, -1, self.n_heads, self.hid_dim //
# The first 1 Step ：Q multiply K The transposition of , Divide scale
# [64,6,12,50] * [64,6,50,10] = [64,6,12,10]
# attention：[64,6,12,10]
attention = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
# hold mask Not empty , Then put mask For 0 The location of attention The score is set to -1e10
# The first 2 Step ： Calculate the result of the previous step softmax, Go through again dropout, obtain attention.
# Be careful , Here's the last dimension softmax, That is, in the dimension of the input sequence softmax
# attention: [64,6,12,10]
attention = self.do(torch.softmax(attention, dim=-1))
# The third step ,attention The result is similar to V Multiply , The result of long attention
# [64,6,12,10] * [64,6,10,50] = [64,6,12,50]
# x: [64,6,12,50]
x = torch.matmul(attention, V)
# Because query Yes 12 A word , So the 12 Put it in front , hold 5 and 60 Put it in the back , To facilitate the following splicing of multiple groups of results
# x: [64,6,12,50] Transpose -> [64,12,6,50]
x = x.permute(0, 2, 1, 3).contiguous()
# The matrix transformation here is ： Put together the results of multiple sets of attention
# The end result is [64,12,300]
# x: [64,12,6,50] -> [64,12,300]
x = self.fc(x)
return x
# batch_size For 64, Yes 12 A word , For each word Query The vector is 300 Vi
query = torch.rand(64, 12, 300)
# batch_size For 64, Yes 12 A word , For each word Key The vector is 300 Vi
key = torch.rand(64, 10, 300)
# batch_size For 64, Yes 10 A word , For each word Value The vector is 300 Vi
value = torch.rand(64, 10, 300)
output = attention(query, key, value)
## output: torch.Size([64, 12, 300])
print(output.shape)

3、tensorflow Realized long attention

# rawKeys The function of is to calculate mask For the time being , Because keys Yes, with position embedding Of , There is no such thing as padding For 0 Value
# numUnits = 50
keepProb = 1
if numUnits is None: # If there is no input value , Go directly to the last dimension of the data , namely embedding size.
numUnits = queries.get_shape().as_list()[-1] #300
# tf.layers.dense You can do multidimensional tensor Nonlinear mapping of data , In calculating self-Attention When , You have to do a nonlinear mapping of these three values ,
# In fact, this step is in the paper Multi-Head Attention The steps of weight mapping for the segmented data in , We're going to map it here, then we're going to split it up , In principle, it's the same .
# Q, K, V The dimensions of are all [batch_size, sequence_length, embedding_size]
Q = tf.layers.dense(queries, numUnits, activation=tf.nn.relu) # [64,10,300]
K = tf.layers.dense(keys, numUnits, activation=tf.nn.relu) # [64,10,300]
V = tf.layers.dense(keys, numUnits, activation=tf.nn.relu) # [64,10,300]
# Divide the data into num_heads One , And then we put together the first dimension
# Q, K, V The dimensions of are all [batch_size * numHeads, sequence_length, embedding_size/numHeads]
Q_ = tf.concat(tf.split(Q, numHeads, axis=-1), axis=0) # [64*6,10,50]
K_ = tf.concat(tf.split(K, numHeads, axis=-1), axis=0) # [64*6,10,50]
V_ = tf.concat(tf.split(V, numHeads, axis=-1), axis=0) # [64*6,10,50]
# Calculation keys and queries The dot product between , Dimensions [batch_size * numHeads, queries_len, key_len], The last two dimensions are queries and keys The sequence length of
similary = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # [64*6,10,10]
# Scale the calculated point product , Divided by the root of the vector length
scaledSimilary = similary / (K_.get_shape().as_list()[-1] ** 0.5) # [64*6,10,10]
# There will be... In the sequence we enter padding This kind of filler , This word should not help the end result , In principle, when padding It's all input 0 When ,
# The calculated weight should also be 0, But in transformer Position vector is introduced in , When you add it to the position vector , Its value is not 0 了 , So in the new position vector
# Before , We need to put it mask For 0. Although in queries There are also such filler words in , But in principle, the results of the model are related to the input , And in self-Attention in
# queryies = keys, So as long as one party is 0, The calculated weight is 0.
# utilize tf,tile Tensor expansion , Dimensions [batch_size * numHeads, keys_len] keys_len = keys The sequence length of
# Add the values of the vectors in each time series to get the average
# rawkKeys:[64,10,300]
keyMasks = tf.sign(tf.abs(tf.reduce_sum(rawKeys, axis=-1))) # Dimensions [batch_size, time_step] [64,10]
#tf.sign() Yes, it will <0 The value of becomes -1, Bigger than 0 The value of becomes 1, Equal to 0 The value of becomes 0
# Find out padding The location of
# Add a dimension , And expand , Get dimensions [batch_size * numHeads, queries_len, keys_len]
keyMasks = tf.tile(tf.expand_dims(keyMasks, 1), [1, tf.shape(queries), 1]) # [64*6,10,10] 10 One for 1 Group
# tf.ones_like The generating elements are all 1, Dimensions and scaledSimilary same tensor, And then we get the value of negative infinity
paddings = tf.ones_like(scaledSimilary) * (-2 ** (32 + 1)) [64*6,10,10]
# tf.where(condition, x, y),condition The elements in are bool value , In which the corresponding True use x Replace the elements in , Corresponding False use y Replace the elements in
# therefore condition,x,y The dimensions of are the same . The following is keyMasks The value in is 0 Just use paddings Replace the value in
# When calculating the current word , Just consider the above , Don't consider the following , Appears in Transformer Decoder in . In text categorization , It can be used only Transformer Encoder.
# Decoder It's a generative model , It is mainly used in language generation
if causality:
diagVals = tf.ones_like(maskedSimilary[0, :, :]) # [queries_len, keys_len]
tril = tf.contrib.linalg.LinearOperatorTriL(diagVals).to_dense() # [queries_len, keys_len]
# Through softmax Calculate the weight coefficient , Dimensions [batch_size * numHeads, queries_len, keys_len]
# Weighted sum to get the output value , Dimensions [batch_size * numHeads, sequence_length, embedding_size/numHeads]
outputs = tf.matmul(weights, V_)
# Will be long Attention The calculated output reconstitutes the original dimension [batch_size, sequence_length, embedding_size]
outputs = tf.concat(tf.split(outputs, numHeads, axis=0), axis=2)
outputs = tf.nn.dropout(outputs, keep_prob=keepProb)
# For each subLayers Establish residual connection , namely H(x) = F(x) + x
outputs += queries
# normalization Layer
#outputs = self._layerNormalization(outputs)
return outputs

The input is ：self.embeddedWords = self.wordEmbedded + self.positionEmbedded, Word embedding + Position insertion

Or with pytorch For example, the input dimension of ：self.wordEmbedded The dimension of [64,10,300] self.positionEmbedded The dimension of is [64,10,300]

When you use it, it's ：

keys=self.embeddedWords)

for example ：（ This simplifies the input ）

wordEmbedded = tf.Variable(np.ones((64,10,300)))
positionEmbedded = tf.Variable(np.ones((64,10,300)))
embeddedWords = wordEmbedded + positionEmbedded