Is Retnet equivalent to ordinary GPT when the decay is set to 1 ?

I'm a little confused of what retnet does in practice. Because in the formula ` Rentention(X) = (Q @ K.T * D) @ V`, if the *decay* is 1, the mathematical derivation of proving the equivalence between RNN and the Retnet's transformer still works. As when *decay* is equal to 1, *D* will be the normal attention mask used by almost all existing GPT models. Does that mean all existing GPT models can be modified into Retnet by simply modifying the inference function without any further training? Am I correct or do I miss something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Retnet equivalent to ordinary GPT when the decay is set to 1 ? #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Is Retnet equivalent to ordinary GPT when the decay is set to 1 ? #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions