Quick review — DeBERTa

Sam Shamsan
1 min readMay 8, 2021

--

Decoding enhanced BERT with disentangled attention

The main motive behind it is that attention weight should be based not only on the content but the position as well. Instead of adding them together, the approach here is to separate them. With this approach we have the advantage of calculating the cross attention with more flexible and diverse methods such as content to content, content to position and position to content.

With 48 transformers layers and 1.5B parameters, DeBERTa is built over BERT and Roberta with three major enhancements:

  • Disentangled attention where each word is represented by two different vercors, one for the content and the other for position, so the attention weight is computed using a disentangled matrix
  • Enhancement over the mask decoding which adds a positional embedding in the level of decoder. This comes from the fact that some position information is important when generating a text. So in this case, the mask decoder is based not only on the token embedding but also in its position embedding as well.
  • During fine tuning, they have virtual adversarial training which has not been tested at this point. This is supposed to make the model generalization more effective. SO they first apply some normalization and then introduce some perturbations to the word embedding vector and train.

Four corpora were used such as wikipedia and bookcorpus and compared with BER and RoBERTa and show some improvement over them in the super GLUE result.

--

--

No responses yet