Transformers process the tokens of a text input in parallel, but unlike sequential models they do not understand position and see the input as a set of tokens. However when we calculate attention for a sentence, words that are the same but in differe...