Problem and Evaluation:
The current change introduces a skip_sliding_mask flag to bypass the automatic sliding-window constraint when using LOCAL_SLIDING attention, but the implementation slightly couples mask selection logic between _transformer.py and _modules.py, which makes the control flow less explicit. In the transformer layer, the code pre-selects either inputs.attention_mask or inputs.sliding_attention_mask and simultaneously sets skip_sliding_mask=True, which effectively shifts responsibility for correctness across two modules. This can make it harder to reason about the final effective attention mask, especially when additional attention variants or future mask types are introduced, and may lead to subtle mismatches between intended and actual masking behavior.
Proposed Solution:
A cleaner and more robust approach would be to centralize all sliding-window behavior inside the attention module itself, so that the transformer only passes raw mask inputs and a clear intent flag. For example, instead of pre-switching masks in _transformer.py, always pass both attention_mask and sliding_attention_mask, along with a clearly named flag such as disable_sliding_window, and let _modules.py handle the final mask composition internally. This keeps all masking logic in one place, reduces duplication, and improves maintainability while making the attention behavior easier to reason about and extend safely in future changes.
Problem and Evaluation:
The current change introduces a skip_sliding_mask flag to bypass the automatic sliding-window constraint when using LOCAL_SLIDING attention, but the implementation slightly couples mask selection logic between _transformer.py and _modules.py, which makes the control flow less explicit. In the transformer layer, the code pre-selects either inputs.attention_mask or inputs.sliding_attention_mask and simultaneously sets skip_sliding_mask=True, which effectively shifts responsibility for correctness across two modules. This can make it harder to reason about the final effective attention mask, especially when additional attention variants or future mask types are introduced, and may lead to subtle mismatches between intended and actual masking behavior.
Proposed Solution:
A cleaner and more robust approach would be to centralize all sliding-window behavior inside the attention module itself, so that the transformer only passes raw mask inputs and a clear intent flag. For example, instead of pre-switching masks in _transformer.py, always pass both attention_mask and sliding_attention_mask, along with a clearly named flag such as disable_sliding_window, and let _modules.py handle the final mask composition internally. This keeps all masking logic in one place, reduces duplication, and improves maintainability while making the attention behavior easier to reason about and extend safely in future changes.