Transformer model consist of encoding and decoding component
Encoding component is a stack of six encoders
Decoding component is a stack of six decoders
Encoders has the identical structure
Encoder does NOT share the same weights
Each encoder has two layers within it.
Self attention layer --> feed forward network.
Self attention layer helps in focus on the other words while encoding the current word
Each decoder has three layers.
Self attention --> encoder-decoder attention --> feed forward network
Encoder-decoder attention helps in focus on relevant part of the input sequence to produce a next word
![[transformer.png]]
High-level encoder
Each word is embedded into a 512 dimensional vector
Embedding only happens at the bottom most encoder
Each encoder receives list of vectors of size 512
For the bottom most encoder, we give a list of embedding vectors. This list of embedding vectors represents a sentence with words represented as vectors.
Size of the list is dependent on the longest sentence in the training corpus.
In Encoder:
the number of vectors (x1,x2,...,xn) we feed to self attention layer, the same number of vector it produces as output (z1,z2,...zn)
then this output is fed through feed forward network to produce (r1,r2,...rn)
Each word flows through its own path in the encoder
in self attention layer, these path are dependent on each other thus words cannot be processed in parallel
but in feed forward network these paths are independent and parallel processing of these words is possible
Encoder Internals
In RNN the hidden state vector is passed along the input sequence to store the meaning of the input sequence
At time step t,
hidden state contains the meaning of words that occurred at previous time steps [1,t−1]
at time step t, model incorporates the meaning of the current word into the hidden state
this hidden state is supposed to learn the inter-word dependencies in the input sequence
but it does not perform that well
self attention is used to incorporate understanding of different words at different locations into the encoding of current word
Self attention in detail
First Step
create three vectors from each of the embedding vectors
key vector (k)
query vector (q)
value vector (v)
these vectors are created by multiplying by three matrices that we train during the training process
kiqivi=ei×WK=ei×WQ=ei×WV
ei is the embedding vector of 512 dimensions
WK , WQ and WV has shape of 512×64
this means shape of ki , qi and vi is 64
Second Step
calculate attention score for each word against the current word
attention score of word xj for the word xi is calculated by dot product between key between key vector of xj and query vector of xi
scoreij=qi⋅kj
scoreij tells you that how much attention should be given to the word xj for the encoding of word xi
Third Step
divide the attention scores by square root of the dimension of key vector
score_newij=dkscoreij
dk dimensions in key vector
softmax function which is applied after this step can be sensitive to very large scores
create multiple new value vectors for the word xi by multiplying value vector vj of each word xj by the softmax_scoreij
v_newij=softmax_scoreij⋅vj
think of value vector as information about a word
by multiplying softmax score with a value vector we are determining how much information of a word should I need to encode a certain word
Sixth Step
Create new vector zi which encodes the word xi by summing new value vectors created from each word in the sentence
zi=j∑v_newij
zi encodes all the understanding from different word at different locations for the current word xi
Matrix Calculation of Attention
suppose we have a 4×512 matrix X
sentence length is 4
embedding vector has 512 dimensions
we have WK, WQ and WV matrices of 512×64 dimensions
now we'll create key, query and value vectors of all words in a sentence
KQV=X×WK=X×WQ=X×WV
K, Q and V each has dimensions of 4×64
now find attention score of every word by multiplying query vector of a word with all the key vectors
A=Q×KT
A has 4×4 dimensions
each ith row represents the attention scores of all words against the ith word in the sentence
now we want to divide every value by square root of number of dimensions in key vector
A=dKA
apply softmax on column wise on matrix A to convert attention scores
S=softmax(A)
now we find Z matrix containing final encoding of each word by multiplying S by value matrix V
Z has 4×64 dimensions
Z=S×V
all the above operations can be written in one formula
Z=softmax(dKQ×KT)×V
Multiheaded Attention
in single headed attention, z vector contains little bit of every word encoding but it could be possible that a irrelevant word encoding may have gotten higher attention score than anyone else.
For example, consider sentence "plane crashed into the sea"
it is possible that attention score for word "crashed" against the word "sea" is higher compared to word "plane"
this means that model is thinking that it is "sea" that is "crashed" and not the "plane" which is meaningless
with multiheaded attention the possibility of domination of a certain word encoding in the z vector is reduced
each head has it's own set of randomly initialized WQ, WK and WV weight matrices
each set will project input embeddings into different subspaces
each head will then create different encoding for the same input embeddings
different encodings of the same input is like assigning multiple meaning to a sentence
each encoding will focus on different aspects of the same sentence
in the actual transformer model number of heads is 8 (this could be any other value)
this creates 8 encoding matrices Z1,Z2,...Z8
each Z1,Z2,...,Z8 has dimensions 4×64
but the feed forward network after this attention layer expects a single matrix
so we concatenate all Z matrices to create a single matrix of dimension 4×512 and multiply with another weight matrix WO of dimension 512×512
Z=cat(Z1,Z2,...,Z8)×WO
Z matrix has dimension 4×512
now this Z matrix goes through a feed forward network to create a new R matrix which is then forwarded to the next encoder block
Efficient Multiheaded Attention
creating multiple heads of attention also comes with multiple sets of WQ, WK and WV weight matrices
suppose k is the number of dimensions in embedding vector of a word
therefore input X will have shape sentence_len×k
a single head in a multiheaded attention can have weight matrices of shape k×hk
every head now has 3hk2 parameters
h number of heads will have h×3hk2=3k2 parameters
which is same as having a single head attention with weight matrices of shape k×k because it will also have 3k2 parameters
each head i produces a Zi matrix of shape sentence_len×hk
concatenating Zis gives another matric of shape sentence_len×k
then we pass this concatenated matrix through feed forward network
Using Positional Encoding
in LSTM or classical RNN techniques model learns relative positions of words by itself because we process sentences sequentially
so during processing of a word, RNN model knows that at what position does word arrive at and what words it had processed before the current word
but in transformers since we are processing the whole sentence in parallel it becomes difficult to learn about relative positions of words within the sentence
that's why positional encodings are used to encode the position of a word into the embedding vector of the word
positional encodings are learned by model and follows a specific pattern at the end of training
positional encodings are added to the embedding vectors at the bottom most encoder layer
two approaches to initialize positional encodings:
fixed positional encodings: using sine and cosine functions
randomly initialized: positional encodings are learned by model itself but it is more computationally expensive
X′=X+P
P is positional embedding matrix of shape sentence_len×k
X is input matrix of shape sentence_len×k
X′ is modified input matrix which also contains the positional information of each word embedding vector
Residual Connection
encoders and decoders are deep neural networks therefore there's risk of vanishing gradients
residual connection prevents vanishing gradient problem
layer normalization prevents inputs from being too small or too large which in turn improves stability
after each self-attention layer and feed forward network layer, there is residual connection
layer normalization is applied on residual connection
similar structure is followed by each decoder block
Decoder Internals
Using target mask
during training, we don't want decoder to know about the next token in the target sequence
to do this we use target mask to zero out attention scores for future tokens
target_mask=111011001
now suppose that we have 3 words in the sentence and we calculate attention score as follows
attention_score=547286139
each ith row represent the attention scores for ith word
for the first word we should not know about attention score for second and third word i.e. future tokens so we zero out those values and similarly we do this for second and third word