Transforming the landscape of ML using Transformers

By now you must have heard about ''Transformers'' — not the movie franchise, but the machine learning model that forms part of the Chat-GPT acronym. The GPT in Chat-GPT stands for Generative Pre-trained Transformer. This article is about transformers, how they revolutionized, not only the field of Natural Language Processing (NLP) but the whole machine learning landscape. One of the goals is to give you an intuition of what ''attention'' blocks in Transformers actually achieve. Through this I hope you also get an intuition for how technologies like Chat-GPT are stretching the boundaries of what AI can currently achieve.

The Role of Auto-Encoders

A key idea that has enabled this sudden rise of capability is a class of ML models called auto-encoders. The advantage they bring to the table is that, they are a kind of unsupervised ML technique. Meaning that they do not require each training sample to be associated with a ''label'' that then becomes the ground truth for teaching the machine. This is because auto-encoders are designed to learn a model that can reproduce the input as the output. Think of a CNN which tries to learn how to accurately generate the input image in the output. Each pixel of the input image itself then becomes the label for that pixel. Such architectures have two parts - an encoder which forms a bottleneck by distilling the input image into a lower dimensional space. And a decoder that tries to generate the final output from the output of the encoder. If trained successfully, we expect the weights of the network to store information about the underlying process that generated the input data. In the context of an NLP model, the input can be a sentence and the output can be predicting each subsequent word from the one preceding it. The fact that auto-encoders can learn from unlabelled data has a huge implication, it saves machine learning engineers from spending countless hours preparing labelled data. Instead, since the internet is mostly a huge corpus of text, they have at their disposal a gold mine of training data for training the machine. The only deciding factor now would be who has the most resources to train networks on such huge datasets.

Does Chat-GPT really understand human language?

Chat-GPT is a product of research in Natural Language Processing (NLP), a subdomain of Artificial Intelligence. The terminology ''Natural Language'' comes because there exists ''programming languages'' that are man-made languages designed for humans to communicate with machines. Let's look at some of the tasks that NLP models are trained to perform:

Classifying whether a movie review is positive or negative - this is called sentiment analysis.
Summarizing a news article in a few words - text summarization.
Translating a sentence from English to French - machine translation.
What should be the next word in this (incomplete) sentence? - text generation

Chat-GPT itself is called a Large Language Model or LLM for short. Although, today LLMs are capable of much more, it is helpful to focus on one of their abilities - text generation. Let's think of Chat-GPT and similar chatbots as systems that can predict the next word given a seed word or sentence. Once we have a system that can do this, if the output along with the initial prompt is fed back to the input we can generate the next word in the sequence. So this is what basically happens under-the-hood of Chat-GPT. Hence it is misleading to say that such models understand all the nuances and complexities of human languages. But they are, indeed, becoming better at the specific task they are designed for.

But to make it possible to do all those amazing things you have seen it do - it leverages the power of the ''transformer" network together with the capacity of learning 150 billion model parameters.

Understanding Transformers

Before transformers, the dominant algorithms were LSTMs and GRUs which were modifications of something called a Recurrent Neural Network (RNN). These were notoriously difficult to train, especially on large datasets because of the sequential nature of their algorithms. Which obviously meant they could not take advantage of the huge parallelization abilities of modern GPUs. One of the contribution of the original paper, that introduced the transformers, was the ''Attention'' block which helped alleviate this problem. The way they solved the problem is by introducing an ''attention'' block that once trained can produce outputs in parallel for each token in an input sentence. To understand transformers we need to recognize what they were designed for. The transformer architecture was designed for machine translation tasks. In the terminology of NLP, it is a kind of sequence-to-sequence model. It employs the auto-encoder algorithm that we had talked about earlier. But instead of using LSTMs or such for the encoder-decoder part it uses ''attention'' blocks and MLPs along with the usual mix of residual connections and normalization layers (see Figure 1.).

Figure 1. A sketch of the Transfomer Encoder block

(Image Credit:Dosovitskiy et. al.)

In the process of learning to reproduce the input, auto-encoders learn a hidden representation of the input (feature) space, which is called the embedding (vector). Thus the transformer encoder after being trained has learnt some useful representation from the input language. Which it then uses to do the reverse, convert the embeddings back to a sentence, this time to another language. At the least, it is helpful to learn how this transformer encoder works since the encoder can be used for other tasks such as text classification etc.

The Attention Block

What made existing models slow, before Transformers came into the picture, was the sequential process of processing each input token in a sentence and updating a state variable. You can think of this state variable as the memory of a neural network. The transformer bypasses the need to do sequential computations with the ''Attention'' block, which you can think of as a smaller neural network that learns three weight matrices from the data, which we call Query(Q), Value(V) and Key(K) for reasons that might become clear later on. The Query matrix multiplied by the transpose of the Key matrix produces a score matrix where each element shows how much a word (token) should pay attention to another word (token). You can think of the word ''attention'' as asking the following question for each word in a sentence - Which other words in this sentence are important for understanding this word? That is, how related each word is to every other word. The key insight here is that, despite the fact that the Attention mechanism computes an attention score for each pair of tokens in a sentence, it achieves parallelization through the use of matrix operations.

We can build on this intuition in a future article which would try to understand how the ''multi-head attention'' block is different from what we just learnt. Also it would explore the other components like MLP layers that together complete the transformer encoder block.

References:

Chollet, F. (2021). Deep learning with Python (Second edition). Manning.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (n.d.). Attention is All you Need.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE.
But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning - https://youtu.be/wjZofJX0v4M
Attention in transformers, visually explained | Chapter 6, Deep Learning - https://youtu.be/eMlx5fFNoYc

Search This Blog

Musings of a Machine Learning Maniac