Quick review — GPT family

Sam Shamsan
2 min readMay 8, 2021

--

GPT is an autoregressive model that leverages a large linguistic information of unlabeled data and uses its representation to tackle many downstream tasks. The model has two stages of training, generative and discriminative. The objective of the generative model is to do some unsupervised learning on the unlabeled data to learn the representation with self supervision and the discriminative stage is where we fine tune the model for specific tasks.

Unlike BERT where it is trained on masked word models, GPT is trained on next word prediction language models.

GPT model is decoder based where we don’t have bidirectional attention since we have the masked layer at the first layer, this allows the model to be blinded for the future words. And to fine tune GPT, we need to present labeled data where our objective is to find the class of specific sequence. This can be done by concatenating two losses, one from the language model and the other from the task itself with a lambda parameter that determines the weight of the LM loss function.

Very similar to the fine tuning of BERT where you can provide the input sequence in a specific format that depends on the task. For instance adding separator between the questions and paragraph for question answering or just adding the sentence embedding for token taggins task. At the final layer, adding a linear layer to count for the task specific classification.

The training data they used in GPT is different from BERT, they only trained with BookCorpus with over 1B words. In their experiments, they found that the more layer you add to the LM, the better accuracy you get.

One of their goals was to optimize the prediction of the next sentence in multi task fashion when the input sequence contains the right sequence and few other options with the wrong sequence. This was also applied to the conversational system where the turn is treated as a sequence of inputs that can be solved with the next sentence prediction. In this approach they combined the word embedding, dialog state embedding and positional embedding into a unified token embedding.

--

--

No responses yet