Skip to content

Is Retnet equivalent to ordinary GPT when the decay is set to 1 ? #37

@xuanyaoming

Description

@xuanyaoming

I'm a little confused of what retnet does in practice. Because in the formula Rentention(X) = (Q @ K.T * D) @ V, if the decay is 1, the mathematical derivation of proving the equivalence between RNN and the Retnet's transformer still works. As when decay is equal to 1, D will be the normal attention mask used by almost all existing GPT models. Does that mean all existing GPT models can be modified into Retnet by simply modifying the inference function without any further training? Am I correct or do I miss something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions