Quick review — ELECTRA

May 8, 2021

ELECTRA is a more recent model than BERT and BART. The motive behind it is a modification of the masking process, instead of randomly mask tokens, we replace them with plausible alternatives generated by a generator network. Also these masked tokens and their attention mechanism are useless since they are irrelevant to our data. Using an efficient sampling technique for the mask words is another advantage of ELECTRA. Using a generator to generate the masked as the next word prediction task. Then feed it to the discriminator model to predict which one is the original word and which is replaced. Both networks are trained jointly. Both generator and disctrimnator are encoder based transformers. It was trained with 10 times as much data as BERT, 3.3B token with wikipedia and bookcorpus.

Quick review — ELECTRA

Written by Sam Shamsan

No responses yet