Quick review — Self attention

Sam Shamsan
1 min readMay 8, 2021

--

The first layer of the encoder architecture in transformers is the self attention. This layer take word embedding in word2vec or any other contextual embedding add to the positional embedding where it measures the words order by calculating the distance between the words.

These input sequences enter the self attention layers in parallel for each word, then it’s multiplied by a fixed shape vector matrix to produce query vector Q, key vector K and value vector V.

Each word embedding has 512 sequence length and produces 64 dimensions for Q, K and V.

Then every query multiplied by the transposed matrix of K which reporces the keys for all words in the corpus. This is the step where self attention starts to be calculated.

They the result of each multiplication qi * ki will be divided by 8 which is the square root of the query vector dimension. Finally apply soft max to each result from the last step. Multiply the last step with the value V to get z which is the vector embedding of the self attention of that specific word.

If this process applied one per each word, we call it a single head. To capture more meaningful representations we create more heads by initiating more weight matrices and initializing them at random. Then concatnat all Z vectors for each word and multiply it by another wight matrix before feed it to feed forward neural network (FFNT)

--

--

No responses yet