I'm a little confused of what retnet does in practice. Because in the formula Rentention(X) = (Q @ K.T * D) @ V, if the decay is 1, the mathematical derivation of proving the equivalence between RNN and the Retnet's transformer still works. As when decay is equal to 1, D will be the normal attention mask used by almost all existing GPT models. Does that mean all existing GPT models can be modified into Retnet by simply modifying the inference function without any further training? Am I correct or do I miss something?
I'm a little confused of what retnet does in practice. Because in the formula
Rentention(X) = (Q @ K.T * D) @ V, if the decay is 1, the mathematical derivation of proving the equivalence between RNN and the Retnet's transformer still works. As when decay is equal to 1, D will be the normal attention mask used by almost all existing GPT models. Does that mean all existing GPT models can be modified into Retnet by simply modifying the inference function without any further training? Am I correct or do I miss something?