Would be interesting to combine it with Reasoning In the Latent Space: feed the vector from the output layer of transformer back to input.Obviously, you can't do it in pre-training. But you can add it later as an optional 'extra' vector, I think. E.g. `input_embedding + MLP(prev_