In the current era of Generative AI and LLMs, everyone is leveraging the capabilities of SOTA Natural Language Generation, Text Summarization, Code Copilots, etc. But, behind the scenes there is an architecture is helping understand the model the context of the input text and then take actions on it. This architecture/model is designed by researchers at Google and they published a paper for Transformers as Attention is All You Need..
In this blog we will go through the research paper implementation and the code written from scratch using PyTorch.
Attention
Attention score is calculated using the dot product between query and keys. This gives us the score that indicates how much focus a query should place on each key. Attention weights are used to calculate the weighted sum of values, which gives us the final output for the each head. This allows model to focus on specific parts of the input based on the attention score.
Encoder Layer
Encoder is a single layer in transformer encoder. Each encoder layer processes the input sequence. Applying self-attention and FFNN (feed-forward neural network) then normalizes and adds the output to the original input. Each layer of Encoder processes the input sequence, allowing the model to build complex representations. The model applies each layer sequentially, passing the output of one layer as a input to the next.
Decoder Layer
A single layer of TransformerDecoderLayer extends the functionality of TransformerEncoderLayer by adding 2nd attention mechanism that allows the decoder to attend to the encoder's output. This is 1st multi-head attention layer which applies self attention to decoder's input (that is, previously generated tokens). It helps the model understand the relationship between different parts of the output sequence. Decoder is composed of multiple TransformerDecoderLayer stacked on top of each other. Each layer process input sequence from the previous layer, allowing the model to build complex representations and generate more accurate outputs.
Combining all layers
The function tokenizes the input text into a sequence of tokens using your vocabulary.
The tokens are converted into a tensor and reshaped to have a desired batch size. Then the tensor is now ready to be fed into the model.
Positional Encoding: The model ensures that the positional encoding matches the length of the input tensor. This step combines the token embeddings with the positional encodings, adding information about the position of each token.
Pass Through the Model: The embedded input is passed through the encoder (and possibly the decoder if you're using a sequence-to-sequence model). The model generates a probability distribution over the vocabulary for the next token.
Reference code can be found here:
Happy Coding!!!