Quick review — ELECTRA

Sam Shamsan
May 8, 2021

--

ELECTRA is a more recent model than BERT and BART. The motive behind it is a modification of the masking process, instead of randomly mask tokens, we replace them with plausible alternatives generated by a generator network. Also these masked tokens and their attention mechanism are useless since they are irrelevant to our data. Using an efficient sampling technique for the mask words is another advantage of ELECTRA. Using a generator to generate the masked as the next word prediction task. Then feed it to the discriminator model to predict which one is the original word and which is replaced. Both networks are trained jointly. Both generator and disctrimnator are encoder based transformers. It was trained with 10 times as much data as BERT, 3.3B token with wikipedia and bookcorpus.

--

--

No responses yet